Open Ahtesham00 opened 1 year ago
Getting this error while using single A100 8G0GB while loading llama-7b
I tried reducing the batch size also changes the --gradient_accumulation_steps but not able to work it out.
I was able to run this in one condition when I used model().cuda().half() but when I tested the saved model it outputted something like this "?? ?? ?? ?? ?? ??" as a result except of generating text.
Error
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 774.00 MiB (GPU 0; 80.00 GiB total capacity; 71.96 GiB already allocated; 791.50 MiB free; 72.18 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I am using following command to execute the script. torchrun --nproc_per_node=1 --master_port=5050 train7b.py --model_name_or_path ./7bWeights/llama-7b --data_path ./alpaca_data_few.json --bf16 True --output_dir ./Dmodel7b --num_train_epochs 3 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 8 --evaluation_strategy "no" --save_strategy "steps" --save_steps 2000 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --fsdp "full_shard auto_wrap" --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' --tf32 True
testing saved model results
do you solved this error while using single A100 8G0GB?
No I have not yet. I did not find any solution. Now I am looking to have more GPUs to train it.
regarding usage of saved model. seams like i am getting corrupted saved model
Getting this error while using single A100 8G0GB while loading llama-7b
I tried reducing the batch size also changes the --gradient_accumulation_steps but not able to work it out.
I was able to run this in one condition when I used model().cuda().half() but when I tested the saved model it outputted something like this "?? ?? ?? ?? ?? ??" as a result except of generating text.
Error
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 774.00 MiB (GPU 0; 80.00 GiB total capacity; 71.96 GiB already allocated; 791.50 MiB free; 72.18 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I am using following command to execute the script. torchrun --nproc_per_node=1 --master_port=5050 train7b.py --model_name_or_path ./7bWeights/llama-7b --data_path ./alpaca_data_few.json --bf16 True --output_dir ./Dmodel7b --num_train_epochs 3 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 8 --evaluation_strategy "no" --save_strategy "steps" --save_steps 2000 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --fsdp "full_shard auto_wrap" --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' --tf32 True
testing saved model results