BERT pretraining details

usuyama commented 3 years ago

Thanks for sharing your great work.

Some quick questions about the BERT pretraining:

What is the max_seq_length (e.g. 128 tokens) during pretraining?
How many training steps / examples used during pretraining?
How did you decide 64,000, the size of WordPiece vocabulary?
Have you tried continual-pretraining from bert-base using the unlabeled data (152 million sentences from the StackOverflow)?

Thank you, Naoto

jeniyat commented 3 years ago

Hi @usuyama , you can find the details here: https://github.com/lanwuwei/BERTOverflow

usuyama commented 3 years ago

Thank you.

For others reference, here's the command that I found in https://github.com/lanwuwei/BERTOverflow

python3 run_pretraining.py \
 --input_file=gs://softbert_data/processed_data/*.tfrecord \
 --output_dir=gs://softbert_data/model_base/ \ 
 --do_train=True \
 --do_eval=True \
 --bert_config_file=gs://softbert_data/model_base/bert_config.json \
 --train_batch_size=512 \
 --max_seq_length=128 \
 --max_predictions_per_seq=20 \
 --num_train_steps=1500000 \
 --num_warmup_steps=10000 \
 --learning_rate=1e-4 \
 --use_tpu=True  \
 --tpu_name=$TPU_NAME --save_checkpoints_steps 100000

Appreciate if you can help me with the other two questions when you have time, @jeniyat

jeniyat commented 3 years ago

Q: How did you decide 64,000, the size of WordPiece vocabulary? A: We experimented with different vocab sizes, and 64k gave the best results

Q: Have you tried continual-pretraining from bert-base using the unlabeled data (152 million sentences from the StackOverflow)? A: No. That would be an interesting experiment to do.

jeniyat / StackOverflowNER

BERT pretraining details #4