Not able to run training/fsdp-qlora-distributed-llama3.ipynb

philschmid / deep-learning-pytorch-huggingface

MIT License

580 stars 138 forks source link

Not able to run training/fsdp-qlora-distributed-llama3.ipynb #55

Closed aasthavar closed 3 weeks ago

aasthavar commented 1 month ago

Hi @philschmid ! Thank you for the blog. Its very helpful.

I am trying to reproduce the results as it is. Followed the blog, installed libraries with same versions.

Running into following issue: ValueError: Must flatten tensors with uniform dtype but got torch.bfloat16 and torch.float32

Someone mentioned setting FSDP_CPU_RAM_EFFICIENT_LOADING=1 here should solve, but this is already set in the torchrun command as per blog.

Pretty much clueless. Any suggestions would be really helpful.

philschmid commented 1 month ago

You use the same versions? Same code? Or did you change something?

aasthavar commented 1 month ago

Okay, I had to install the latest flash-attn library to get rid of this error first (when I just ran the notebook as it is): ImportError: /opt/conda/envs/pytorch/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda14ExchangeDeviceEa

# Install Pytorch for FSDP and FA/SDPA
%pip install --quiet "torch==2.2.2" tensorboard 

# Install Hugging Face libraries
%pip install  --upgrade "transformers==4.40.0" "datasets==2.18.0" "accelerate==0.29.3" "evaluate==0.4.1" "bitsandbytes==0.43.1" "huggingface_hub==0.22.2" "trl==0.8.6" "peft==0.10.0"

# I added  
%pip install flash-attn --no-build-isolation
%pip install "torch==2.3.1"

No change in code.

Is there a specific flash-attn lib version, I should be using ?

aasthavar commented 1 month ago

@philschmid No worries, able to make it work. Changed tf32's value from true to false. Did a quick test for max_steps=10. The script ran completely.

This is weird, usually combination of bf16: true and tf32: true works but here it didn't. Wonder why ?

lakshya-B commented 1 month ago

@philschmid I encountered the same issue. However, when I changed bf16: true to bf16:false and tf32:true to tf16:true, it started working. I have another query. I am trying to fine-tune the Llama-3 8B model on a GPU with 15 GB of RAM, specifically using 4 NVIDIA T4 GPUs. I was running the same code you provided in the blog, but my entire model was being stored on a single GPU, causing a GPU out of memory error. Do you have any suggestions?

philschmid commented 1 month ago

T4 GPUs are not supporting Bf16 of TF32 thats expected.

lakshya-B commented 1 month ago

@philschmid regarding the training using 4 15GB GPUs, what do you think? I am using a smaller model (8B)

Oliph commented 2 weeks ago

I have the same error on 4 H100 GPU. If I set up tf32 to false it does not solve anything. Same when doing tf16:true as in https://github.com/philschmid/deep-learning-pytorch-huggingface/issues/55#issuecomment-2164420616