Problem with finetuning bloom

tatsu-lab / stanford_alpaca

Code and documentation to train Stanford's Alpaca models, and generate the data.

https://crfm.stanford.edu/2023/03/13/alpaca.html

Apache License 2.0

29.32k stars 4.03k forks source link

Problem with finetuning bloom #111

Open raihan0824 opened 1 year ago

raihan0824 commented 1 year ago

What is the fsdp_transformer_layer_cls_to_wrap for bloom?

When I tried to fine tune with bloomz-7b1, the training stuck on 0%. As you said in the readme, it's most likely because I dont set the right fsdp_transformer_layer_cls_to_wrap . But I cant find it in the bloom config.

Kindly need a help on this. Thank you

frankzhao112 commented 1 year ago

I get the same question. Does the traing code here only support llama or opt model? Can we finetune the bloom using its official training framework with stanford_alpaca data?

raihan0824 commented 1 year ago

any help on this?

frankzhao112 commented 1 year ago

No， I have the same issue. Do u know BELLE, they use bloom as the base model instead of llama.

garcesmarc commented 1 year ago

certificado ambietnal gis.pdf

raihan0824 commented 1 year ago

No， I have the same issue. Do u know BELLE, they use bloom as the base model instead of llama.

I've read it and it's exactly what I’m looking for. However, I can't find the finetuning script, any help on this?

raihan0824 commented 1 year ago

It seems like the finetuning script is referring back to this repo based on this https://github.com/LianjiaTech/BELLE/issues/26, which is our problem

quanliu1991 commented 1 year ago

I have same issue.

frankzhao112 commented 1 year ago

Are u Chinese? 说汉语呗

frankzhao112 commented 1 year ago

u can check bloom training code in bloom github. Bloom already opens its trainning code, I think u can find tranning code there.

floodsung commented 1 year ago

change to this: --fsdp_transformer_layer_cls_to_wrap 'BloomBlock' and it works

weberrr commented 1 year ago

change to this: --fsdp_transformer_layer_cls_to_wrap 'BloomBlock' and it works

thanks, but still error: tensor a (256905216) must match the size of tensor b (1027620864) is there hyper param need to be fix?

raihan0824 commented 1 year ago

change to this: --fsdp_transformer_layer_cls_to_wrap 'BloomBlock' and it works

still gets the same error, what type of bloom model are you running? can you please share the training script?

raihan0824 commented 1 year ago

how do you run it?

raihan0824 commented 1 year ago

I used this to run with the original training script: torchrun --nproc_per_node=3 --master_port=5001 train.py \ --model_name_or_path bigscience/bloomz-7b1 \ --data_path ./alpaca_data.json \ --bf16 True \ --output_dir ./model_trained \ --num_train_epochs 3 \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 2000 \ --save_total_limit 1 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --fsdp "full_shard auto_wrap" \ --fsdp_transformer_layer_cls_to_wrap ‘BloomBlock‘ \ --tf32 True

and gets this error: Exception: Could not find the transformer layer class to wrap in the model.

weberrr commented 1 year ago

how do you run it?

i can run my code, it can load model and data, but still mem error like this:

CUDA out of memory. Tried to allocate 770.00 MiB (GPU 0; 79.35 GiB total capacity; 75.33 GiB already allocated; 679.19 MiB free; 77.53 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management an

weberrr commented 1 year ago

change your transformers>=4.23 and try

raihan0824 commented 1 year ago

I use transformers 4.27.4

raihan0824 commented 1 year ago

how do you run it?

i can run my code, it can load model and data, but still mem error like this:

CUDA out of memory. Tried to allocate 770.00 MiB (GPU 0; 79.35 GiB total capacity; 75.33 GiB already allocated; 679.19 MiB free; 77.53 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management an

its because you lack gpu memory, try to run it with more gpu

weberrr commented 1 year ago

I think your code is ok