Open yfliao opened 1 year ago
Try use AdamW 8b optimizer from:
https://github.com/TimDettmers/bitsandbytes/tree/ec5fbf4cc44324829307138a4c17fd88dddd9803
After installation, just add flag to the script call:
--optim adamw_bnb_8bit
Current Transformers version natively supports bitsandbytes.
With AdamW 8b optimizer I run training on 4x Quadro RTX 8000 49GB. RTX 8000 isn't a ampere GPU, so instead of bf16 and tf32 low precision , I use fp16.
With AdamW 8b optimizer I run training on 4x Quadro RTX 8000 49GB
That doesn't help us to fine-tune on a single 24GB RTX 3090, no?
Is it possible to finetune the 7B model using 8*3090? I had set:
but still got OOM:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 194.00 MiB (GPU 0; 23.70 GiB total capacity; 22.21 GiB already allocated; 127.56 MiB free; 22.50 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
my scruptis as follows:
torchrun --nproc_per_node=4 --master_port=12345 train.py \ --model_name_or_path ../llama-7b-hf \ --data_path ./alpaca_data.json \ --bf16 True \ --output_dir ./output \ --num_train_epochs 3 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 1 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 2000 \ --save_total_limit 1 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --fsdp "full_shard auto_wrap" \ --fsdp_transformer_layer_cls_to_wrap 'LLaMADecoderLayer' \ --tf32 True