train/train_loss is always 0.0

mikeybellissimo / LoRA-MPT

A repo for finetuning MPT using LoRA. It is currently configured to work with the Alpaca dataset from Stanford but can easily be adapted to use another.

Apache License 2.0

18 stars 7 forks source link

train/train_loss is always 0.0 #6

Open lorabit110 opened 1 year ago

lorabit110 commented 1 year ago

I tried LoRA tuning mpt-7b and mpt-7b-instruct. I can get summary like this: wandb: Run summary: wandb: eval/loss nan wandb: eval/runtime 37.2157 wandb: eval/samples_per_second 53.741 wandb: eval/steps_per_second 1.693 wandb: train/epoch 0.6 wandb: train/global_step 234 wandb: train/total_flos 2.9642010569107046e+17 wandb: train/train_loss 0.0 wandb: train/train_runtime 1465.8832 wandb: train/train_samples_per_second 20.367 wandb: train/train_steps_per_second 0.16

But the train/loss is always 0 and eval/loss is always nan. Also, when I load the model using generate.py, it always generates "&". I have tried both yahma/alpaca-cleaned and a manually created simple dataset.

lorabit110 commented 1 year ago

It's likely caused by the multi-GPU system (AWS EC2 p3.8xlarge with 4 V100s) I used. I tried using another single-GPU VM (AWS EC2 g3.2xlarge with 1 A10) and it worked. I am getting non-zero/nan eval loss.

mikeybellissimo commented 1 year ago

To be honest I built this for running on a single gpu system so the code puts the entire model on the first gpu. I will try and get around to implementing this for multiple gpus as soon as possible but I am working on some other things as well. If you don’t want to wait, the deepspeed library will most likely be the move, I wouldn’t recommend using huggingfaces built in naive approach (with device map auto, etc) since it uses a naive multi-gpu system that will actually slow down training since each gpu is idle while the next goes.

mikeybellissimo commented 1 year ago

Nevermind, ended up deciding to give it a go. Just install deepspeed and run with deepspeed command instead of python and you're set. Example: deepspeed src/finetune.py --base_model 'mosaicml/mpt-7b-instruct' --data_path 'yahma/alpaca-cleaned' --output_dir './lora-mpt' --lora_target_modules '[Wqkv]' --lora_r 8 --cutoff_len 768 --batch_size 128 --micro_batch_size 8

mikeybellissimo commented 1 year ago

Let me know if you have any further issues.