Crashes during finetuning

gameveloster commented 1 year ago

I am trying to do a 30B finetune on 2x3090 using data parallel and the process will always crash before 3 steps is completed. I run the finetune script in a screen session on a remote computer, and the screen session is gone when i reestablish the SSH connection after the crash.

This is the command I use to start finetune

torchrun --nproc_per_node=2 --master_port=1234 finetune.py ./testdocs.txt \
    --ds_type=txt \
    --lora_out_dir=./loras/ \
    --llama_q4_config_dir=./models/Neko-Institute-of-Science_LLaMA-30B-4bit-128g \
    --llama_q4_model=./Neko-Institute-of-Science_LLaMA-30B-4bit-128g/llama-30b-4bit-128g.safetensors \
    --mbatch_size=1 \
    --batch_size=2 \
    --epochs=3 \
    --lr=3e-4 \
    --cutoff_len=256 \
    --lora_r=8 \
    --lora_alpha=16 \
    --lora_dropout=0.05 \
    --warmup_steps=5 \
    --save_steps=50 \
    --save_total_limit=3 \
    --logging_steps=5 \
    --groupsize=-1 \
    --xformers \
    --backend=cuda \
    --grad_chckpt

Has anyone else gotten the same crashing problem?

ghost commented 1 year ago

probably because you ran out of VRAM. Try using a batch size of 1.

johnsmith0031 commented 1 year ago

I think you should keep the ssh session, or running the task in background

johnsmith0031 / alpaca_lora_4bit

Crashes during finetuning #131