bad results after pretraining

KavyaGujjala commented 5 years ago

Hi,

I have run the pretraining.py data on domain specific text.

I gave 9 lakh sentences text file, batch size 32, learning rate 2e-5, num of train steps 10000.

masked lm accuracy - 69%

After that I used that model.ckpt-10000 in the init_checkpoint of

python extract_features.py \ --input_file=/tmp/input.txt \ --output_file=/tmp/output.jsonl \ --vocab_file=$BERT_BASE_DIR/vocab.txt \ --bert_config_file=$BERT_BASE_DIR/bert_config.json \ --init_checkpoint=./pretraining_output_10000/model.ckpt-10000 \ --layers=-1,-2,-3,-4 \ --max_seq_length=128 \ --batch_size=8

and got the [CLS] token for sentence representations from last layer to compare similarity between sentences.

cos similarity between dissimilar sentences if very high and vice versa.

What should I do to get good sentence vectors?

KavyaGujjala commented 5 years ago

Hi,

I have run the pretraining.py data on domain specific text.

I gave 9 lakh sentences text file, batch size 32, learning rate 2e-5, num of train steps 10000.

masked lm accuracy - 69%

After that I used that model.ckpt-10000 in the init_checkpoint of

python extract_features.py --input_file=/tmp/input.txt --output_file=/tmp/output.jsonl --vocab_file=$BERT_BASE_DIR/vocab.txt --bert_config_file=$BERT_BASE_DIR/bert_config.json --init_checkpoint=./pretraining_output_10000/model.ckpt-10000 --layers=-1,-2,-3,-4 --max_seq_length=128 --batch_size=8

and got the [CLS] token for sentence representations from last layer to compare similarity between sentences.

cos similarity between dissimilar sentences if very high and vice versa.

What should I do to get good sentence vectors?

I have trained on bert base uncased model

commands used

python create_pretraining_data.py \ --input_file=./sent_text.txt \ --output_file=./tf_examples.tfrecord \ --vocab_file=$BERT_BASE_DIR/vocab.txt \ --do_lower_case=True \ --max_seq_length=64 \ --max_predictions_per_seq=10 \ --masked_lm_prob=0.15 \ --random_seed=12345 \ --dupe_factor=5

python run_pretraining.py \ --input_file=./tf_examples.tfrecord \ --output_dir=./pretraining_output_10000 \ --do_train=True \ --do_eval=True \ --bert_config_file=$BERT_BASE_DIR/bert_config.json \ --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \ --train_batch_size=32 \ --max_seq_length=64 \ --max_predictions_per_seq=10 \ --num_train_steps=10000 \ --num_warmup_steps=10 \ --learning_rate=2e-5

jicksonp commented 5 years ago

Hello @KavyaGujjala, Did you find any solution to this problem?

KavyaGujjala commented 5 years ago

Hello @KavyaGujjala, Did you find any solution to this problem?

Hi @jicksonp , increase the training steps the accuracy will increase for masked lm and next sentence prediction.

But still I couldnt get good sentence representations how many pooling methods I used.

ghost commented 5 years ago

We have already fine-tuned a couple of BERT models successfully to domain data. A few things that we have learned and could be worth testing in you case:

BERT is very sensitive to learning rate. Try a few larger + smaller values. Might depend on your training corpus.
Warmup steps are important because they relate to the learning rate schedule. 10 steps in your case seems very low. Try to bump this up.
depending on the task for your sentence embeddings you might wanna derive them from different parts of the trained network using different pooling strategies (not always final CLS token). Check out the QA section here https://github.com/hanxiao/bert-as-service and this great article by Han Xiao https://hanxiao.github.io/2019/01/02/Serving-Google-BERT-in-Production-using-Tensorflow-and-ZeroMQ/

How do your loss curves look like?

KavyaGujjala commented 5 years ago

We have already fine-tuned a couple of BERT models successfully to domain data. A few things that we have learned and could be worth testing in you case:

BERT is very sensitive to learning rate. Try a few larger + smaller values. Might depend on your training corpus.

Warmup steps are important because they relate to the learning rate schedule. 10 steps in your case seems seems very low. Try to bump this up.

depending on the task for your sentence embeddings you might wanna derive them from different parts of the trained network using different pooling strategies (not always final CLS token). Check out the QA section here https://github.com/hanxiao/bert-as-service and this great article by Han Xiao https://hanxiao.github.io/2019/01/02/Serving-Google-BERT-in-Production-using-Tensorflow-and-ZeroMQ/

How do your loss curves look like?

Hi @deepset-ai sorry for the delayed response.

When I trained for 50k steps, loss was 1.708

I never changed warmup steps and didnt know its related to learning rate schedule. Also I increased the dataset size to 10M sentences thinking its the data problem.

If my dataset size is 5L sentences and I want to train it for 50k steps with learning rate 2e-5 what should be the number of warmup steps?

ghost commented 5 years ago

It depends a bit on your target corpus, but if you want to keep the 50k steps you should probably try something between 20-30k warmup steps. Also consider increasing your training steps. You might also want to checkout this article https://ufal.mff.cuni.cz/pbml/110/art-popel-bojar.pdf by Martin Popel and Ondřej Bojar. They report on many interesting experiments for transformer training, incl. effect of warmup steps and learning rate.

KavyaGujjala commented 5 years ago

It depends a bit on your target corpus, but if you want to keep the 50k steps you should probably try something between 20-30k warmup steps. Also consider increasing your training steps. You might also want to checkout this article https://ufal.mff.cuni.cz/pbml/110/art-popel-bojar.pdf by Martin Popel and Ondřej Bojar. They report on many interesting experiments for transformer training, incl. effect of warmup steps and learning rate.

Hi @deepset-ai , I tried giving 3 lakh steps and 50k warmup steps with learning rate 5e-5 for 9L sentences dataset.

These are the results

INFO:tensorflow:***** Eval results *****
INFO:tensorflow:  global_step = 300000
INFO:tensorflow:  loss = 0.75486124
INFO:tensorflow:  masked_lm_accuracy = 0.82143104
INFO:tensorflow:  masked_lm_loss = 0.6522004
INFO:tensorflow:  next_sentence_accuracy = 0.9725
INFO:tensorflow:  next_sentence_loss = 0.1032866

This is the highest accuracy I got till now.

I tried getting sentence embeddings for few sentences and did clustering. CLustering results are not good.

I did hierarchical pooling over last 4 layers of trained model ( excluded [CLS] and [SEP] tokens , considered all the other tokens including ## tokens )

Dont know if its the problem with the embeddings or how I am doing clustering.

Any idea on how to do this?

ghost commented 5 years ago

Glad to hear that tweaking the LR helped!

Extracting meaningful sentence embeddings depends again on your corpus and the cluster you want to obtain. The two things you should experiment with are pooling strategy and used layers. Keep in mind: the deeper the layer, the higher the level of semantic concepts (see this excellent paper for details https://arxiv.org/abs/1905.05950).

Regarding pooling strategies you could try:

reduce_mean (e.g. on second last layer)
using the CLS token only (gave us good results, especially if you fine-tune on another down-stream task before)

If you have some explicit dimensions in mind that should be represented in you clusters, it could also be worth adding a related proxy-task for down-stream training.

Scagin commented 5 years ago

Which device did you use to train your model, GPU or TPU? I found that I cannot train my model by GPU while I don't have TPUs, and the code just running on CPU. How can I run on GPU, or how can I train faster?

wahab4114 commented 3 years ago

@KavyaGujjala @deepsetai I am also trying to pretrain BERT on dataset which contains 150k sentence with batch_size: int = 64 lr: int = 1e-5. After 8 epochs nsm task gives accuracy around 0.77 (overfits if I continue with more epochs) and mlm task starts from 0.01 acc and goes till 0.2 at max. What is wrong here? Can I stop nsm at one point and continue to do mlm for longer period of time? My training iter len is 2486 (2486 training steps per epochs) which means 40*2486=99440.

Timoeller commented 3 years ago

Hey @wahab4114 always nice to see folks training their own Language Models. 150k sentences do not seem like a lot of data, are you sure you want to train a LM from scratch or would finetuning an already pretrained LM be a better choice? LM finetuning you can do in FARM or HF transformers. Also, a batch size of 64 might not be suitable for training a model from scratch. If you can, you should consider increasing it, maybe by lowering your max_seq_len?

Addressing your question: yes, disabling NSM might be a good idea, since it is still debatable if this task improves LM capabilities in the first place, e.g. RoBERTa was trained without NSM alltogether. The accuracy for MLM should increase above 0.2, though this number highly depends on your data. See https://github.com/google-research/bert/issues/557 for some reference accuracies.

wahab4114 commented 3 years ago

@Timoeller Thanks a lot for your response. The reason is that I have not enough resources and time to pretrain it on larger dataset and I want to pretrain two bert models (one with nsm+mlm and one with nsm+mlm+third_objective). Later, I want to compare these two by doing fine-tuning on downstream tasks. Is it okay for me to pretrain bert models on smaller dataset and later do comparison or not?

Timoeller commented 3 years ago

Sure, please do train on the data you have available.

In experiments we made even an untrained BERT reaches decent performance and when you look into Figure 1 of our paper you will see that the LMs reach very good performance early on during pretraining. This early checkpoints should be similar to training with few datapoints. I would be interested to see your downstream task performance.

wahab4114 commented 3 years ago

@Timoeller what you are trying to say is that accuracy of my model regarding mlm task should be good during early phase of pretraining or you are saying that even an early check point of bert performs well on downstream task?

Timoeller commented 3 years ago

even an early check point of bert performs well on downstream task?

This might be the case, depending on a couple of factors (LM config, pretraining data, downstream data).

wahab4114 commented 3 years ago

@Timoeller I have a dataset of size around 100MB. This data is very less as compared to the size of the dataset used in your paper. Do you think this could be the reason for MLM not converging?

Timoeller commented 3 years ago

Of course this might be the reason, but it is hard to pin it down. Why do you not start with an already trained model and finetune/adapt it on your data?

Just to make sure we have the same understanding of terminology: LM finetuning and LM adaptation is used as synonyms in the community, there is also finetuning on downstream task with supervised labels.

You could also finetune/adapt the LM with nsm+mlm and in another run with nsm+mlm+third_objective since it seems you want to find out the influence of the third_objective on downstream tasks.

google-research / bert

bad results after pretraining #529