Open mathamateur opened 3 weeks ago
Hi,
Hello! I have prepared a log of training for SFT of GPT-2, since I have a different results of SFT as well. Please, have a look. Also I have checked my environment and noticed that I have lower versions of deepspeed, torch and transformers then you recommend. Could it be a problem? train.log requirements.txt
Hi, I've checked your log and I think this is because you didn't load the correct deepspeed config file for fp32, i.e., ./configs/deepspeed/ds_config_fp32.json
. However, it was our mistake to miss this config in the training scripts. Now the new scripts have been updated in the repo and you can retry them and see whether they work as you expected.
Thanks for your clarification! However, I noticed this problem in training script by myself previously and fixed it the same way as you. You can check, that in my train.log file deepspeed config is correct. So, I believe, that I have actually run experiments in fp32. I guess, the problem is somewhere else...
Thanks for your clarification! However, I noticed this problem in training script by myself previously and fixed it the same way as you. You can check, that in my train.log file deepspeed config is correct. So, I believe, that I have actually run experiments in fp32. I guess, the problem is somewhere else...
Hi, in your provided log, the printed Argument has deepspeed_config=None
, which means that the corresponding ds config for fp32 was not loaded successfully.
Moreover, loss scaler will only appear under fp16, so I think the model was trained in fp16.
So I suggest to check your scripts again and ensure that the correct config has been loaded (maybe you can print the model_dtype before training).
I have tried to run gpt2 sft in fp16 with your new script and noticed, that deepspeed_config=None in this case as well :((( gpt2_sft_fp16_train.log
I have tried to run gpt2 sft in fp16 with your new script and noticed, that deepspeed_config=None in this case as well :((( gpt2_sft_fp16_train.log
Can you pull our latest code and see again the deepspeed_config in Arguments? Or may I have a look at your training script for fp32?
Hello! I have pulled the latest version of your repo and tried to run sft of gpt-2 in fp32 again. Indeed, this time deepspeed_config
has logged as ds_config_fp32.json
. However, the loss scaler has been activated for this configuration, so I am afraid, that model is still trained in fp16. Here is my train log
sft_gpt2_fp32_new_train.log
Hello! I have pulled the latest version of your repo and tried to run sft of gpt-2 in fp32 again. Indeed, this time
deepspeed_config
has logged asds_config_fp32.json
. However, the loss scaler has been activated for this configuration, so I am afraid, that model is still trained in fp16. Here is my train log sft_gpt2_fp32_new_train.log
Hi, I think the problem comes from --model_dtype
. It is set to fp16 by default. You can try to pass this parameter in your training script like --model_dtype fp32
.
Dear authors! I have tried to reproduce your results on the dolly dataset with Qwen1.5 as a teacher and gpt-2 as a student. Unfortunately, my results are differ from yours.
Could you clarify:
All other settings for each experiments were maintained the same as in provided scripts.
I'll be very glad to get you replies.