Open 41924076 opened 8 months ago
I attempted to replicate the training process of Bloom560m using the following parameters, but my training results weren't satisfactory. I suspect there might be an issue with the training parameters I've used.
param: 8 A800 GPU "batch_size": 1024, "negs_per_ins": 8, "max_seq_length": 300, "model_name":"(mydir)/bloom-560m", "seed": 0, "steps_per_epoch": null, "epochs": 3, "setting": "spec", "pooling": "lasttoken", "default_type": "query", "dataset": "allnli,msmarco", "debias_batch": true, "warmup_epoch":0.25, "lr": 0.0004, "model_save_path":"(mydir)", "use_amp"; false, "wandb": false, "wandbwatchlog": "all", "local_rank": -1, "freeze": false, "freezenonbias": true, "unfreezewte": false, "chunksize":64, "tf32": true, "devset":"all"
In each epoch, the log shows: dataloader length: 762, CKPT save steps: 96, warm up step: 24.
I used args.devset == 'all', the evaluation results during training are: 1 epoch: msmarco_dev_small ndcg@10=0.26402, mrr@10=0.21343, stsbenchmark cos_sim_spearman_score=0.70745 2 epoch: msmarco_dev_small ndcg@10=0.28632, mrr@10=0.23291, stsbenchmark cos_sim_spearman_score=0.72313 3 epoch: msmarco_dev_small ndcg@10=0.29121, mrr@10=0.23728, stsbenchmark cos_sim_spearman_score=0.72577
Thank you for the interest in our work
All training logs with args (on A100 GPU):
Thank you so much for sharing your log file!
@izhx 您好,我按照您560m的参数设置,使用4 A800 GPU训练,第一轮的loss和eval结果和您的log不太一样,想请教一下您觉得可能的原因会是什么? 我的1 epoch: msmarco_dev_small ndcg@10=0.32329, mrr@10=0.27043, stsbenchmark cos_sim_spearman_score=0.74720 您的1 epoch: msmarco_dev_small ndcg@10=0.34639, mrr@10=0.28911, stsbenchmark cos_sim_spearman_score=0.82822
有可能是因为accelerate config或者哪个库的版本不一样吗? 可否请问一下,您的accelerate config和accelerate launch输入命令是什么?
您好,数据和库版本两方面都有可能。
accelerate 我没有做特殊配置,启动命令如下:
accelerate launch train.py \
--model_name bigscience/bloom-560m \
--lr 4e-4 --epochs 5 \
--setting spec --pooling lasttoken \
--default_type query \
--negs_per_ins=8 \
--dataset msmarco,allnli \
--batch_size 256 \ # 这里用的4卡
--chunksize 128 \
--max_seq_length=300 --warmup_epoch 0.25 \
--freezenonbias --tf32 --debias_batch \
--model_save_path XXXX
我的主要库版本为:
accelerate 0.21.0
mteb 1.0.2
sentence-transformers 2.2.2 # 修正lasttoken那个 PR 的版本 https://github.com/UKPLab/sentence-transformers/pull/2111
torch 2.0.0
transformers 4.29.2
非常感谢您的回复,看起来和我的设置差不多,这个性能差距的问题太玄学了。
@izhx Could you kindly share the training scripts for models of varying scales, or alternatively, the complete set of training parameters used for models at different scales? I saw the following parameters in your paper (AdamW learning rate=4e-4, warmup period=0.25 epochs, batch size=1024, 8 negative examples, 8 A100-80GB GPUs, TF32). And I'd like to know the values of other parameters (whether all scales of the bloom model use epoch=3, chunksize=64, debias_batch, freezenonbias, max_seq_length=default 300 as written in train.py, and if there are any additional parameters included).