Does this work for Llama2 - Fine-tune Falcon 180B with DeepSpeed ZeRO, LoRA & Flash Attention?

philschmid / deep-learning-pytorch-huggingface

MIT License

580 stars 138 forks source link

Does this work for Llama2 - Fine-tune Falcon 180B with DeepSpeed ZeRO, LoRA & Flash Attention? #37

Open ibicdev opened 10 months ago

ibicdev commented 10 months ago

Thanks Phil for the great post "Fine-tune Falcon 180B with DeepSpeed ZeRO, LoRA & Flash Attention". When I tried to change falon to llama2 (tried all 3 mode sizes), I always got "CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)". Should there be more changes than just model name to make it work? Or will you have a follow up post about fine tuning Llama2 with DeepSpeed + LoRA?

philschmid commented 10 months ago

Seems to be an hardware and environment issue unrelated to the code. I used cuda 11.8

ibicdev commented 10 months ago

I am also using cuda 11.8, and pytorch 2.01 for cuda 11.8. Also tried pytorch nightly and got the same error. --use_flash_attn False didn't make a difference either. The error is RuntimeError: CUDA error: device-side assert triggered, followed by about a hundred lines of

../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [313,0,0], thread: [64,0,0] Assertion srcIndex < srcSelectDimSize failed.

This error looked similar to https://github.com/lm-sys/FastChat/issues/199; tried their suggests and none worked. One explanation on that thread is the vocab causing embedding lookup out-of-bounds issue though the vocab seems already fixed in llama-2.

philschmid commented 10 months ago

Does the example without code changes work?

ibicdev commented 10 months ago

Yes, it worked well without any code change.

philschmid commented 9 months ago

What change did you make?

ibicdev commented 9 months ago

The only change I made is --model_id, from tiiuae/falcon-180B to meta-llama/Llama-2-70b-hf. The full command is

torchrun --nproc_per_node 8 run_ds_lora.py \
  --model_id meta-llama/Llama-2-70b-hf \
  --dataset_path dolly-processed \
  --output_dir falcon-180b-lora-fa \
  --num_train_epochs 3 \
  --per_device_train_batch_size 1 \
  --learning_rate 4e-3 \
  --gradient_checkpointing True \
  --gradient_accumulation_steps 8 \
  --bf16 True \
  --tf32 True \
  --use_flash_attn True \
  --lr_scheduler_type "constant_with_warmup" \
  --logging_steps 25 \
  --save_steps 100 \
  --save_total_limit 3 \
  --deepspeed configs/ds_falcon_180b_z3.json

philschmid commented 9 months ago

did you make changes to the flash attention patch? The example only works with falcon since it has a custom patch to use flash attention.

ibicdev commented 9 months ago

Ahh, I didn't. I saw your code https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/utils/peft_utils.py#L38-L41, and thought it's already taken care of.

Also, even when I used --use_flash_attn False I still got the same error.

ibicdev commented 9 months ago

Excited to see flash-attn 2 natively supported in transformers! Would you plan to update this post to work with this new feature?

philschmid commented 9 months ago

Yes! 👍🏻 Plan to update all my posts and remove that patches once there is an official release.

ibicdev commented 9 months ago

Great! Looking forward to the updates.