hiyouga / LLaMA-Factory

Unify Efficient Fine-Tuning of 100+ LLMs
Apache License 2.0
25.26k stars 3.13k forks source link

Out of Memory Error on Sagemaker while training LLava on 93000 images #4562

Closed Hassaan68 closed 2 days ago

Hassaan68 commented 2 days ago

Reminder

System Info

Out of Memory error during tokenization. tried streaming and facing same issue with streaming:true and max_steps:10000. I am finetuning LLava on 93000 images and tokenizer just report No Space left on device error after tokenizing around 52000 images. I can see that my sagemaker cache is 75GB after this making the space memory full. how to counter this issue?

The command

llamafactory-cli train \
    --stage sft \
    --do_train True \
    --model_name_or_path llava-hf/llava-1.5-7b-hf \
    --preprocessing_num_workers 16 \
    --finetuning_type lora \
    --template vicuna \
    --flash_attn fa2 \
    --visual_inputs True \
    --dataset_dir data \
    --dataset icentia11k \
    --cutoff_len 1024 \
    --learning_rate 5e-05 \
    --num_train_epochs 10.0 \
    --max_steps 10000 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --max_grad_norm 1.0 \
    --logging_steps 5 \
    --save_steps 100 \
    --warmup_steps 0 \
    --optim adamw_torch \
    --packing False \
    --report_to none \
    --output_dir saves/LLaVA1.5-7B-Chat/lora/train_2024-06-26-11-09-00 \
    --fp16 True \
    --plot_loss True \
    --ddp_timeout 180000000 \
    --include_num_input_tokens_seen True \
    --lora_rank 8 \
    --lora_alpha 32 \
    --lora_dropout 0 \
    --use_dora True \
    --lora_target all  \ 
    --streaming True

Reproduction

There should not be memory issue and model should tokenize each image before using it and not tokenize all images together

Expected behavior

Should be able to tokenize large amount of images

Others

No response

hiyouga commented 2 days ago

remove include_num_input_tokens_seen True

Hassaan68 commented 1 day ago

@hiyouga , Thank you :) . its improved but still not completely avoiding the cache. I am alsp using --overwrite_cache True in my command. but still datasets are using huge cache memory as you can see below

17G     /home/sagemaker-user/.cache/huggingface/hub
43G     /home/sagemaker-user/.cache/huggingface/datasets
60G     /home/sagemaker-user/.cache/huggingface