Closed Hassaan68 closed 4 days ago
Try dataset streaming: streaming: true
@hiyouga I am still facing the issue with streaming:true and max_steps:10000. I am finetuning LLava on 93000 images and tokenizer just report No Space left on device error after tokenizing around 52000 images. I can see that my sagemaker cache is 75GB after this making the space memory full. how to counter this issue?
Full Command:
llamafactory-cli train \
--stage sft \
--do_train True \
--model_name_or_path llava-hf/llava-1.5-7b-hf \
--preprocessing_num_workers 16 \
--finetuning_type lora \
--template vicuna \
--flash_attn fa2 \
--visual_inputs True \
--dataset_dir data \
--dataset icentia11k \
--cutoff_len 1024 \
--learning_rate 5e-05 \
--num_train_epochs 10.0 \
--max_steps 10000 \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 8 \
--lr_scheduler_type cosine \
--max_grad_norm 1.0 \
--logging_steps 5 \
--save_steps 100 \
--warmup_steps 0 \
--optim adamw_torch \
--packing False \
--report_to none \
--output_dir saves/LLaVA1.5-7B-Chat/lora/train_2024-06-26-11-09-00 \
--fp16 True \
--plot_loss True \
--ddp_timeout 180000000 \
--include_num_input_tokens_seen True \
--lora_rank 8 \
--lora_alpha 32 \
--lora_dropout 0 \
--use_dora True \
--lora_target all \
--streaming True
Reminder
System Info
I am using 8 GPUs to fine-tune LLava1.5-7B-Chat on more than 8000 images, but the tokenizer tries to run tokenization on all 8000 images at once, causing a memory error. 8300 is the maximum number of images I am able to train on
Reproduction
Finetune LLava on more than 8000 images
Expected behavior
There should be a distributed way to tokenize and load the images one by one.
Others
No response