huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.72k stars 26.22k forks source link

Cannot reproduce example token classification GermEval 2014 (German NER) dataset #7419

Closed GarrettLee closed 3 years ago

GarrettLee commented 3 years ago

Environment info

Who can help

@stefan-it Please help.

Information

Model I am using (Bert, XLNet ...): bert-base-multilingual-cased

The problem arises when using:

I am running the pytorch version: transformers/examples/token-classification/run_ner.py

The tasks I am working on is:

GermEval 2014 (German NER) dataset

To reproduce

Steps to reproduce the behavior:

  1. download dataset: https://drive.google.com/drive/folders/1kC0I2UGl2ltrluI9NqDjaQJGw5iliw_J?usp=sharing
  2. Because our accese for training model can not access Internet, I download the pretrained model here: https://huggingface.co/bert-base-multilingual-cased, and put all the downloaded files at transformers/examples/token-classification/bert-base-multilingual-cased
  3. 
    cat NER-de-train.tsv | grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > train.txt.tmp
    cat NER-de-dev.tsv | grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > dev.txt.tmp
    cat NER-de-test.tsv | grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > test.txt.tmp

export MAX_LENGTH=128 export BERT_MODEL=./bert-base-multilingual-cased

python3 scripts/preprocess.py train.txt.tmp $BERT_MODEL $MAX_LENGTH > train.txt python3 scripts/preprocess.py dev.txt.tmp $BERT_MODEL $MAX_LENGTH > dev.txt python3 scripts/preprocess.py test.txt.tmp $BERT_MODEL $MAX_LENGTH > test.txt

cat train.txt dev.txt test.txt | cut -d " " -f 2 | grep -v "^$"| sort | uniq > labels.txt

export OUTPUT_DIR=germeval-model export BATCH_SIZE=32 export NUM_EPOCHS=3 export SAVE_STEPS=750 export SEED=1

python3 run_ner.py --data_dir ./ \ --labels ./labels.txt \ --model_name_or_path $BERT_MODEL \ --output_dir $OUTPUT_DIR \ --max_seq_length $MAX_LENGTH \ --num_train_epochs $NUM_EPOCHS \ --per_device_train_batch_size $BATCH_SIZE \ --save_steps $SAVE_STEPS \ --seed $SEED \ --do_train \ --do_eval \ --do_predict



<!-- If you have code snippets, error messages, stack traces please provide them here as well.
     Important! Use code tags to correctly format your code. See https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
     Do not use screenshots, as they are hard to read and (more importantly) don't allow others to copy-and-paste your code.-->

## Expected behavior

The f1 score on evaluation and test should be `0.8784592370979806` and `0.8624150210424085` as the README write. However, by runing the script above, on one V100 GPU, I get `0.83919` on evaluation, and `0.81673` on test, much lower than expected.

<!-- A clear and concise description of what you would expect to happen. -->
GarrettLee commented 3 years ago

I find that after deleting the cache, the results can be reproduced. I guess that is because in my first attempt, I used the wrong arguments setting, and something is cached. Then although I fixed the setting later, the code alway loads from the wrong cache.