Help with : LoRA issue in distributed setting

alielfilali01 commented 4 months ago

System Info

Hello there, i'm trying to follow this tuto from the documentation in order to finetune a model in a distributed setting (currently testing with a 7B model), I'm doing the training in huggingface's spaces using a Jupyter docker image with 4 L4 GPUs (using terminal not notebook)

The error is simpy ModuleNotFoundError: No module named 'torch._six'

Who can help?

@pacman100 and @stevhliu

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder
[ ] My own task or dataset (give details below)

Reproduction

My script so far :

git clone https://github.com/huggingface/peft.git
cd peft

# trying to make sure everything is installed
pip install -r requirements.txt
pip install -r examples/sft/requirements_colab.txt
pip install -r examples/sft/requirements.txt

accelerate config --config_file deepspeed_config.yaml

accelerate launch --config_file "deepspeed_config.yaml"  examples/sft/train.py \
--seed 100 \
--model_name_or_path "meta-llama/Llama-2-7b-hf" \
--dataset_name "AbderrahmanSkiredj1/moroccan_darija_wikipedia_dataset" \
--chat_template_format "none" \
--add_special_tokens False \
--append_concat_token False \
--splits "train,test" \
--max_seq_len 2048 \
--num_train_epochs 1 \
--logging_steps 5 \
--log_level "info" \
--logging_strategy "steps" \
--evaluation_strategy "epoch" \
--save_strategy "epoch" \
--push_to_hub \
--hub_private_repo True \
--hub_strategy "every_save" \
--bf16 True \
--packing True \
--learning_rate 1e-4 \
--lr_scheduler_type "cosine" \
--weight_decay 1e-4 \
--warmup_ratio 0.0 \
--max_grad_norm 1.0 \
--output_dir "llama2-7b-wiki-ary-sft-lora-deepspeed" \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 8 \
--gradient_accumulation_steps 4 \
--gradient_checkpointing True \
--use_reentrant False \
--dataset_text_field "content" \
--use_flash_attn True \
--use_peft_lora True \
--lora_r 8 \
--lora_alpha 16 \
--lora_dropout 0.1 \
--lora_target_modules "all-linear" \
--use_4bit_quantization False

Expected behavior

to finish training and push the adapter to the hub !?

BenjaminBossan commented 4 months ago

Based on the error message, it's almost certainly not a PEFT issue, the same would happen if you did not use PEFT. Instead, it seems that your deepspeed version and PyTorch versions don't match. In this order, try upgrading the following packages and check if this solves the problem:

deepspeed
torch
accelerate

Oh, and a tip for the future, don't post screenshots, paste the text instead when writing an issue.

alielfilali01 commented 4 months ago

Thanks dear @BenjaminBossan 🤗

huggingface / peft