finetuning on 3090, is it possible?

yfliao commented 1 year ago

Is it possible to finetune the 7B model using 8*3090? I had set:

--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \

but still got OOM:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 194.00 MiB (GPU 0; 23.70 GiB total capacity; 22.21 GiB already allocated; 127.56 MiB free; 22.50 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

my scruptis as follows:

torchrun --nproc_per_node=4 --master_port=12345 train.py \ --model_name_or_path ../llama-7b-hf \ --data_path ./alpaca_data.json \ --bf16 True \ --output_dir ./output \ --num_train_epochs 3 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 1 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 2000 \ --save_total_limit 1 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --fsdp "full_shard auto_wrap" \ --fsdp_transformer_layer_cls_to_wrap 'LLaMADecoderLayer' \ --tf32 True

Wojx commented 1 year ago

Try use AdamW 8b optimizer from: https://github.com/TimDettmers/bitsandbytes/tree/ec5fbf4cc44324829307138a4c17fd88dddd9803 After installation, just add flag to the script call: --optim adamw_bnb_8bit Current Transformers version natively supports bitsandbytes.

With AdamW 8b optimizer I run training on 4x Quadro RTX 8000 49GB. RTX 8000 isn't a ampere GPU, so instead of bf16 and tf32 low precision , I use fp16.

otto-dev commented 1 year ago

With AdamW 8b optimizer I run training on 4x Quadro RTX 8000 49GB

That doesn't help us to fine-tune on a single 24GB RTX 3090, no?

tatsu-lab / stanford_alpaca

finetuning on 3090, is it possible? #73