Closed tfzhou closed 1 year ago
Just found that the issue stems from flash-attention. after turning into train.py
, the pretrain works properly.
@haotian-liu what's the version of flash-att you using? my version is 2.0.4. probably I should use 1.x? Beyond this, do you observe performance differences when using train.py
or train_mem.py
?
@tfzhou
I have locally tested this again on 2x 3090, per-device batch size=16, on llama-2-7b-chat. train.py
and train_mem.py
works similarly for me. I am using flash-attention 2.0.4, pytorch 2.0.1, and my cuda version is 11.7.
One thing is to make sure that the cuda verision of pytorch and your nvcc when you compile flash attention is the same. (please kindly let me know if this is the case, so that other community members can benefit as well :)
You may choose to downgrade to flash attention 1.x, and our code base currently support both 1.x and 2.x for A100s.
Also I attached the log of first 35 training steps on 2x 3090 (total bs: 16x2=32). It seems that your LR is not correctly decayed as the warmup should only be 3%.
Also, please check the transformers version:
"deepspeed==0.9.5",
"peft==0.4.0",
"transformers==4.31.0",
"accelerate==0.21.0",
"bitsandbytes==0.41.0",
Thanks @haotian-liu.
One thing is to make sure that the cuda verision of pytorch and your nvcc when you compile flash attention is the same.
I am pretty sure that different versions are used in my setup. I will try to fix this and let u know.
btw. after turning into train.py
mode, the training works as expected and lr decay is not an issue any more.
@tfzhou I see. The only drawback of using train.py
is that it will be slower, and use more memory, which will be more prominent when you switch to finetune mode.
After recompiled flash-att using a matched nvcc, the issue has been fixed. Thanks @haotian-liu
@haotian-liu Can you post your full train log in the pre-training stage for reference?
Describe the issue
Hi Haotian, thanks for the efforts on the project. At the moment I am trying to reproduce the pretrain stage, but got stuck in it. I have tried to train from various language models
vicuna-7b-v1.3/v1.5, Llama-2-7b-chat-hf
using deepspeed withzero2
orzero3
configurations. Unfortunately, these experiments did not go well -- training loss fails to converge and I found that LR schedule did not follow 'cosine' as specified in the command. I am unfamiliar with deepspeed, and uncertain whether the issue is from deepspeed. More details provided below and appreciate for your help.Btw. I used 4 A100 with 40GB memory for experiments.
Command:
zero2 (not changed)
Screenshots: