Cannot reproduce example token classification GermEval 2014 (German NER) dataset

Environment info

transformers version: 3.2.0
Platform: Linux-4.4.0-131-generic-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.8
PyTorch version (GPU?): 1.3.1 (True)
Tensorflow version (GPU?): 2.2.0 (False)
Using GPU in script?: True
Using distributed or parallel set-up in script?: False

Who can help

@stefan-it Please help.

Information

Model I am using (Bert, XLNet ...): bert-base-multilingual-cased

The problem arises when using:

[x] the official example scripts: (give details below)

I am running the pytorch version: transformers/examples/token-classification/run_ner.py

The tasks I am working on is:

[x] an official GLUE/SQUaD task: (give the name)

GermEval 2014 (German NER) dataset

To reproduce

Steps to reproduce the behavior:

download dataset: https://drive.google.com/drive/folders/1kC0I2UGl2ltrluI9NqDjaQJGw5iliw_J?usp=sharing
Because our accese for training model can not access Internet, I download the pretrained model here: https://huggingface.co/bert-base-multilingual-cased, and put all the downloaded files at transformers/examples/token-classification/bert-base-multilingual-cased


cat NER-de-train.tsv | grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > train.txt.tmp
cat NER-de-dev.tsv | grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > dev.txt.tmp
cat NER-de-test.tsv | grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > test.txt.tmp

export MAX_LENGTH=128 export BERT_MODEL=./bert-base-multilingual-cased

python3 scripts/preprocess.py train.txt.tmp $BERT_MODEL $MAX_LENGTH > train.txt python3 scripts/preprocess.py dev.txt.tmp $BERT_MODEL $MAX_LENGTH > dev.txt python3 scripts/preprocess.py test.txt.tmp $BERT_MODEL $MAX_LENGTH > test.txt

cat train.txt dev.txt test.txt | cut -d " " -f 2 | grep -v "^$"| sort | uniq > labels.txt

export OUTPUT_DIR=germeval-model export BATCH_SIZE=32 export NUM_EPOCHS=3 export SAVE_STEPS=750 export SEED=1

python3 run_ner.py --data_dir ./ \ --labels ./labels.txt \ --model_name_or_path $BERT_MODEL \ --output_dir $OUTPUT_DIR \ --max_seq_length $MAX_LENGTH \ --num_train_epochs $NUM_EPOCHS \ --per_device_train_batch_size $BATCH_SIZE \ --save_steps $SAVE_STEPS \ --seed $SEED \ --do_train \ --do_eval \ --do_predict



<!-- If you have code snippets, error messages, stack traces please provide them here as well.
     Important! Use code tags to correctly format your code. See https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
     Do not use screenshots, as they are hard to read and (more importantly) don't allow others to copy-and-paste your code.-->

## Expected behavior

The f1 score on evaluation and test should be `0.8784592370979806` and `0.8624150210424085` as the README write. However, by runing the script above, on one V100 GPU, I get `0.83919` on evaluation, and `0.81673` on test, much lower than expected.

<!-- A clear and concise description of what you would expect to happen. -->

huggingface / transformers