Closed aasthavar closed 3 weeks ago
You use the same versions? Same code? Or did you change something?
Okay, I had to install the latest flash-attn library to get rid of this error first (when I just ran the notebook as it is):
ImportError: /opt/conda/envs/pytorch/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda14ExchangeDeviceEa
# Install Pytorch for FSDP and FA/SDPA
%pip install --quiet "torch==2.2.2" tensorboard
# Install Hugging Face libraries
%pip install --upgrade "transformers==4.40.0" "datasets==2.18.0" "accelerate==0.29.3" "evaluate==0.4.1" "bitsandbytes==0.43.1" "huggingface_hub==0.22.2" "trl==0.8.6" "peft==0.10.0"
# I added
%pip install flash-attn --no-build-isolation
%pip install "torch==2.3.1"
No change in code.
Is there a specific flash-attn lib version, I should be using ?
@philschmid No worries, able to make it work. Changed tf32's value from true to false. Did a quick test for max_steps=10. The script ran completely.
This is weird, usually combination of bf16: true and tf32: true works but here it didn't. Wonder why ?
@philschmid I encountered the same issue. However, when I changed bf16: true
to bf16:false
and tf32:true
to tf16:true
, it started working. I have another query. I am trying to fine-tune the Llama-3 8B model on a GPU with 15 GB of RAM, specifically using 4 NVIDIA T4 GPUs. I was running the same code you provided in the blog, but my entire model was being stored on a single GPU, causing a GPU out of memory error. Do you have any suggestions?
T4 GPUs are not supporting Bf16 of TF32 thats expected.
@philschmid regarding the training using 4 15GB GPUs, what do you think? I am using a smaller model (8B)
I have the same error on 4 H100 GPU. If I set up tf32 to false it does not solve anything. Same when doing tf16:true as in https://github.com/philschmid/deep-learning-pytorch-huggingface/issues/55#issuecomment-2164420616
Hi @philschmid ! Thank you for the blog. Its very helpful.
I am trying to reproduce the results as it is. Followed the blog, installed libraries with same versions.
Running into following issue:
ValueError: Must flatten tensors with uniform dtype but got torch.bfloat16 and torch.float32
Someone mentioned setting FSDP_CPU_RAM_EFFICIENT_LOADING=1 here should solve, but this is already set in the torchrun command as per blog.
Pretty much clueless. Any suggestions would be really helpful.