luyug / Condenser

EMNLP 2021 - Pre-training architectures for dense retrieval
Apache License 2.0
244 stars 23 forks source link

Resources and time required for pre-training #4

Closed YuLengsen closed 2 years ago

YuLengsen commented 2 years ago

Thank you for your excellent work!Can you share that how much resources and time did you spend on pre-training condenser and cocondenser? And what's the batch and epoch were used?

luyug commented 2 years ago

Condenser is pre-trained on 4 RTX 2080 ti for roughly one week. We used a batch of 1024 x 128 tokens and up to 8 epochs.

coCondenser is further pre-trained on the same hardware for roughly 2~3 days. We used a batch of 2000 x 128 tokens and up to 8 epochs.

I found a fast diminishing marginal gain increasing epoch number after 4.

I can't make concrete promise for the moment but I am planing on releasing code for pre-training on TPU, which should give decent speed up.

1024er commented 2 years ago

batch

I am trying to reproduce the codenser pretraining results. I evaluate the checkpoint on the sts-b task with sentence-transformer, but the results are different. (1)bert-base-uncased 2022-01-03 17:07:01 - Load pretrained SentenceTransformer: output/training_stsbenchmark_bert-base-uncased-2022-01-03_17-04-06 2022-01-03 17:07:02 - Use pytorch device: cuda 2022-01-03 17:07:02 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-test dataset: 2022-01-03 17:07:05 - Cosine-Similarity : Pearson: 0.8484 Spearman: 0.8419 2022-01-03 17:07:05 - Manhattan-Distance: Pearson: 0.8345 Spearman: 0.8322 2022-01-03 17:07:05 - Euclidean-Distance: Pearson: 0.8349 Spearman: 0.8328 2022-01-03 17:07:05 - Dot-Product-Similarity: Pearson: 0.7521 Spearman: 0.7421

(2) Luyu/condenser 2022-01-03 17:12:46 - Load pretrained SentenceTransformer: output/training_stsbenchmark_Luyu-condenser-2022-01-03_17-09-51 2022-01-03 17:12:48 - Use pytorch device: cuda 2022-01-03 17:12:48 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-test dataset: 2022-01-03 17:12:50 - Cosine-Similarity : Pearson: 0.8528 Spearman: 0.8504 2022-01-03 17:12:50 - Manhattan-Distance: Pearson: 0.8394 Spearman: 0.8380 2022-01-03 17:12:50 - Euclidean-Distance: Pearson: 0.8396 Spearman: 0.8378 2022-01-03 17:12:50 - Dot-Product-Similarity: Pearson: 0.7942 Spearman: 0.7819

(3)self-trained checkpoints 2022-01-03 17:34:30 - Load pretrained SentenceTransformer: output/training_stsbenchmark_output--2022-01-03_17-31-48 2022-01-03 17:34:32 - Use pytorch device: cuda 2022-01-03 17:34:32 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-test dataset: 2022-01-03 17:34:34 - Cosine-Similarity : Pearson: 0.8498 Spearman: 0.8469 2022-01-03 17:34:34 - Manhattan-Distance: Pearson: 0.8415 Spearman: 0.8396 2022-01-03 17:34:34 - Euclidean-Distance: Pearson: 0.8423 Spearman: 0.8402 2022-01-03 17:34:34 - Dot-Product-Similarity: Pearson: 0.7959 Spearman: 0.7826

I run the pretraining on 8x32G V100 with the following settings:

python -m torch.distributed.launch --nproc_per_node 8 run_pre_training.py \ --output_dir output \ --model_name_or_path bert-base-uncased \ --do_train \ --save_steps 20000 \ --per_device_train_batch_size 128 \ --gradient_accumulation_steps 1 \ --fp16 \ --warmup_ratio 0.1 \ --learning_rate 1e-4 \ --num_train_epochs 8 \ --overwrite_output_dir \ --dataloader_num_workers 16 \ --n_head_layers 2 \ --skip_from 6 \ --max_seq_length 128 \ --train_dir data \ --weight_decay 0.01 \ --late_mlm

I use per_device_train_batch_size =128 and the global_batch_size = 128 x 8 = 1024. Will you please provide me some suggestions ? thank you

luyug commented 2 years ago

@1024er Can you share what pre-training corpus are you using?