luyug / Condenser

EMNLP 2021 - Pre-training architectures for dense retrieval
Apache License 2.0
244 stars 23 forks source link

failed to reproduce the condenser pretraining results on V100 #10

Closed 1024er closed 2 years ago

1024er commented 2 years ago

I am trying to reproduce the codenser pretraining results. I evaluate the checkpoint on the sts-b task with sentence-transformer, but the results are different. (1)bert-base-uncased 2022-01-03 17:07:01 - Load pretrained SentenceTransformer: output/training_stsbenchmark_bert-base-uncased-2022-01-03_17-04-06 2022-01-03 17:07:02 - Use pytorch device: cuda 2022-01-03 17:07:02 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-test dataset: 2022-01-03 17:07:05 - Cosine-Similarity : Pearson: 0.8484 Spearman: 0.8419 2022-01-03 17:07:05 - Manhattan-Distance: Pearson: 0.8345 Spearman: 0.8322 2022-01-03 17:07:05 - Euclidean-Distance: Pearson: 0.8349 Spearman: 0.8328 2022-01-03 17:07:05 - Dot-Product-Similarity: Pearson: 0.7521 Spearman: 0.7421

(2) Luyu/condenser 2022-01-03 17:12:46 - Load pretrained SentenceTransformer: output/training_stsbenchmark_Luyu-condenser-2022-01-03_17-09-51 2022-01-03 17:12:48 - Use pytorch device: cuda 2022-01-03 17:12:48 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-test dataset: 2022-01-03 17:12:50 - Cosine-Similarity : Pearson: 0.8528 Spearman: 0.8504 2022-01-03 17:12:50 - Manhattan-Distance: Pearson: 0.8394 Spearman: 0.8380 2022-01-03 17:12:50 - Euclidean-Distance: Pearson: 0.8396 Spearman: 0.8378 2022-01-03 17:12:50 - Dot-Product-Similarity: Pearson: 0.7942 Spearman: 0.7819

(3)self-trained checkpoints 2022-01-03 17:34:30 - Load pretrained SentenceTransformer: output/training_stsbenchmark_output--2022-01-03_17-31-48 2022-01-03 17:34:32 - Use pytorch device: cuda 2022-01-03 17:34:32 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-test dataset: 2022-01-03 17:34:34 - Cosine-Similarity : Pearson: 0.8498 Spearman: 0.8469 2022-01-03 17:34:34 - Manhattan-Distance: Pearson: 0.8415 Spearman: 0.8396 2022-01-03 17:34:34 - Euclidean-Distance: Pearson: 0.8423 Spearman: 0.8402 2022-01-03 17:34:34 - Dot-Product-Similarity: Pearson: 0.7959 Spearman: 0.7826

I run the pretraining on 8x32G V100 with the following settings:

python -m torch.distributed.launch --nproc_per_node 8 run_pre_training.py --output_dir output --model_name_or_path bert-base-uncased --do_train --save_steps 20000 --per_device_train_batch_size 128 --gradient_accumulation_steps 1 --fp16 --warmup_ratio 0.1 --learning_rate 1e-4 --num_train_epochs 8 --overwrite_output_dir --dataloader_num_workers 16 --n_head_layers 2 --skip_from 6 --max_seq_length 128 --train_dir data --weight_decay 0.01 --late_mlm

I use per_device_train_batch_size =128 and the global_batch_size = 128 x 8 = 1024. The pre-training data is bookcorpus + wikipedia , created with released code by nvidia

raw dara: 5.0G bookscorpus_one_book_per_line.txt 13G wikicorpus_en_one_article_per_line.txt

after being preprocessed: 24G book_wiki.json containing 41420334 lines with maxlen=128

I used the data to train bert-large and was able to reach F1=90% on squad task, so I think the corpus should be fine..

Will you please provide me some suggestions ? thank you

1024er commented 2 years ago

I also find it that the variance of Spearman correlation on Test Set is quite large, is the result in the paper the average of the results of multiple experiments?

I ran the default settings 4 times : python training_stsbenchmark.py Luyu/condenser

2022-01-03 20:58:39 - Load SentenceTransformer from folder: output/training_stsbenchmark_Luyu-condenser-2022-01-03_20-55-43 2022-01-03 20:58:41 - Use pytorch device: cuda 2022-01-03 20:58:41 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-test dataset: 2022-01-03 20:58:44 - Cosine-Similarity : Pearson: 0.8549 Spearman: 0.8509 2022-01-03 20:58:44 - Manhattan-Distance: Pearson: 0.8470 Spearman: 0.8447 2022-01-03 20:58:44 - Euclidean-Distance: Pearson: 0.8473 Spearman: 0.8450 2022-01-03 20:58:44 - Dot-Product-Similarity: Pearson: 0.8059 Spearman: 0.7951

2022-01-03 20:59:11 - Load SentenceTransformer from folder: output/training_stsbenchmark_Luyu-condenser-2022-01-03_20-56-02 2022-01-03 20:59:14 - Use pytorch device: cuda 2022-01-03 20:59:14 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-test dataset: 2022-01-03 20:59:16 - Cosine-Similarity : Pearson: 0.8570 Spearman: 0.8541 2022-01-03 20:59:16 - Manhattan-Distance: Pearson: 0.8504 Spearman: 0.8497 2022-01-03 20:59:16 - Euclidean-Distance: Pearson: 0.8508 Spearman: 0.8496 2022-01-03 20:59:16 - Dot-Product-Similarity: Pearson: 0.8144 Spearman: 0.8035

2022-01-03 19:27:40 - Load pretrained SentenceTransformer: output/training_stsbenchmark_Luyu-condenser-2022-01-03_19-24-38 2022-01-03 19:27:41 - Use pytorch device: cuda 2022-01-03 19:27:41 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-test dataset: 2022-01-03 19:27:43 - Cosine-Similarity : Pearson: 0.8538 Spearman: 0.8493 2022-01-03 19:27:43 - Manhattan-Distance: Pearson: 0.8457 Spearman: 0.8433 2022-01-03 19:27:43 - Euclidean-Distance: Pearson: 0.8462 Spearman: 0.8434 2022-01-03 19:27:43 - Dot-Product-Similarity: Pearson: 0.8088 Spearman: 0.7982**

2022-01-03 21:00:00 - Load SentenceTransformer from folder: output/training_stsbenchmark_Luyu-condenser-2022-01-03_20-57-04 2022-01-03 21:00:03 - Use pytorch device: cuda 2022-01-03 21:00:03 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-test dataset: 2022-01-03 21:00:05 - Cosine-Similarity : Pearson: 0.8505 Spearman: 0.8468 2022-01-03 21:00:05 - Manhattan-Distance: Pearson: 0.8432 Spearman: 0.8414 2022-01-03 21:00:05 - Euclidean-Distance: Pearson: 0.8440 Spearman: 0.8424 2022-01-03 21:00:05 - Dot-Product-Similarity: Pearson: 0.8108 Spearman: 0.8005

1024er commented 2 years ago

sentence-transformer == 1.2.1 transformer == 4.2.0

luyug commented 2 years ago

I'd need more information on your pre-training/fine-tuning setup/scripts to understand the situation and provide suggestions.

One caveat with the sentence-transformer package is that some example scripts use mean pooling by default while Condenser is designed for CLS pooling; you may need to make some slight code adjustments.

1024er commented 2 years ago

I'd need more information on your pre-training/fine-tuning setup/scripts to understand the situation and provide suggestions.

One caveat with the sentence-transformer package is that some example scripts use mean pooling by default while Condenser is designed for CLS pooling; you may need to make some slight code adjustments.

Thank you, you are right. I modify the pooling type to [CLS] and able to achieve almost comparable results:

image

Thank you for your code and guidance.

Hannibal046 commented 2 years ago

Hi, could you please tell where to get the training corpus?

5.0G bookscorpus_one_book_per_line.txt

13G wikicorpus_en_one_article_per_line.txt

luyug commented 2 years ago

@Hannibal046 I'd recommend either the nvidia megatron repo as mentioned above or wikipedia and bookcorpusopen from huggingface dataset hub.

Hannibal046 commented 2 years ago

@luyug Hi, Thanks for your response. But I find the Nvidia Repo download_wikipedia can not work for me. How can I post-process my data after downloading wiki and book_corpus from HuggingFace ? I am recently trying to Pre-train a Bert model from scratch. Any help? Thanks so much !!