modelscope / ms-swift

Use PEFT or Full-parameter to finetune 400+ LLMs or 100+ MLLMs. (LLM: Qwen2.5, Llama3.2, GLM4, Internlm2.5, Yi1.5, Mistral, Baichuan2, DeepSeek, Gemma2, ...; MLLM: Qwen2-VL, Qwen2-Audio, Llama3.2-Vision, Llava, InternVL2, MiniCPM-V-2.6, GLM4v, Xcomposer2.5, Yi-VL, DeepSeek-VL, Phi3.5-Vision, ...)
https://swift.readthedocs.io/zh-cn/latest/Instruction/index.html
Apache License 2.0
4.27k stars 377 forks source link

OOM when tokenizing datasets #1971

Closed SparrowZheyuan18 closed 1 month ago

SparrowZheyuan18 commented 2 months ago

Describe the bug What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)

I encountered an OOM error when trying to DPO MiniCPM-LLaMA-v-2.5 with my own dataset and 4 rtx6000ada. The OOM error seems to occur at the

train_dataset, val_dataset = get_preprocessed_rlhf_dataset(
    train_dataset,
    val_dataset,
    template=template,
    rlhf_type=args.rlhf_type,
    vision_keys=vision_keys,
    max_length=args.max_length,
    max_prompt_length=args.max_prompt_length,
    truncation_mode=args.truncation_mode,
    streaming=streaming,
    is_encoder_decoder=is_encoder_decoder,
    **preprocess_kwargs)

part of the code. Why is this happening? Do you have any solution to this? Thanks!

My training script:

nproc_per_node=2

CUDA_VISIBLE_DEVICES=0,1,2,3 \ NPROC_PER_NODE=$nproc_per_node \ MASTER_PORT=29500 \ swift rlhf \ --rlhf_type dpo \ --model_type minicpm-v-v2_5-chat \ --model_id_or_path workspace/MiniCPM-V/merged_MiniCPM-Llama3-V-2_5 \ --ref_model_type minicpm-v-v2_5-chat \ --ref_model_id_or_path workspace/MiniCPM-V/merged_MiniCPM-Llama3-V-2_5 \ --sft_type lora \ --tuner_backend swift \ --dtype AUTO \ --output_dir output/minicpm_dpo \ --dataset /workspace/DPO/data/dpo_data.jsonl \ --beta 0.1 \ --sft_beta 0.1 \ --num_train_epochs 4 \ --max_length 1200 \ --max_prompt_length 512 \ --check_dataset_strategy none \ --lora_rank 8 \ --lora_alpha 32 \ --lora_dropout 0.05 \ --lora_target_modules DEFAULT \ --gradient_checkpointing true \ --batch_size 1 \ --weight_decay 0.1 \ --learning_rate 5e-5 \ --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \ --max_grad_norm 1.0 \ --warmup_ratio 0.03 \ --eval_steps 2000 \ --save_steps 100 \ --save_total_limit 2 \ --logging_steps 10 \ --use_flash_attn true \

Your hardware and system info Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)

Additional context Add any other context about the problem here(在这里补充其他信息)

SparrowZheyuan18 commented 2 months ago

When I use deepspeed, another error will occur when initiating deepspeed:

File "/workspace/lib/python3.10/site-packages/deepspeed/runtime/lr_schedules.py", line 814, in __init__
        self.warmup_num_steps = max(2, warmup_num_steps)

TypeError: '>' not supported between instances of 'str' and 'int'

I've found the same issue in https://github.com/LianjiaTech/BELLE/issues/558. The script I use is:

nproc_per_node=4

CUDA_VISIBLE_DEVICES=0,1,2,3 \
NPROC_PER_NODE=$nproc_per_node \
MASTER_PORT=29500 \
swift rlhf \
    --rlhf_type dpo \
    --model_type  minicpm-v-v2_5-chat \
    --model_id_or_path /workspace/MiniCPM-V/merged_MiniCPM-Llama3-V-2_5 \
    --ref_model_type  minicpm-v-v2_5-chat \
    --ref_model_id_or_path /workspace/MiniCPM-V/merged_MiniCPM-Llama3-V-2_5 \
    --sft_type  lora \
    --tuner_backend  swift \
    --dtype  AUTO  \
    --output_dir  output/minicpm_dpo  \
    --dataset  /workspace/DPO/data/dpo_data.jsonl  \
    --beta 0.1 \
    --sft_beta 0.1 \
    --num_train_epochs  4  \
    --max_length  1200  \
    --max_prompt_length  512  \
    --check_dataset_strategy  none  \
    --lora_rank  8  \
    --lora_alpha  32  \
    --lora_dropout  0.05  \
    --lora_target_modules  ALL  \
    --gradient_checkpointing  true  \
    --batch_size  1  \
    --weight_decay  0.1  \
    --learning_rate  5e-5  \
    --gradient_accumulation_steps  $(expr 16 / $nproc_per_node)  \
    --max_grad_norm  1.0  \
    --warmup_ratio  0.03  \
    --eval_steps  2000  \
    --save_steps  100  \
    --save_total_limit  2  \
    --logging_steps  10 \
    --use_flash_attn true \
    --deepspeed zero3-offload
tastelikefeet commented 2 months ago

Use with --tuner_backend peft --lazy_tokenize true And may I ask how many rows in the data file please? The DeepSpeed problem is because our ds config has fields with values of auto, this will be fixed today

SparrowZheyuan18 commented 2 months ago

In this case, I have only 2000 rows in the dataset. Thanks for your help :)

SparrowZheyuan18 commented 2 months ago

I use the peft as backend and lazy_tokenize. I've encountered another problem:

Traceback (most recent call last): File "/workspace/miniconda3/envs/geoguessr/lib/python3.10/site-packages/swift/cli/rlhf.py", line 5, in rlhf_main() File "/workspace/miniconda3/envs/geoguessr/lib/python3.10/site-packages/swift/utils/run_utils.py", line 32, in x_main result = llm_x(args, kwargs) File "/workspace/miniconda3/envs/geoguessr/lib/python3.10/site-packages/swift/llm/rlhf.py", line 270, in llm_rlhf trainer.train(training_args.resume_from_checkpoint) File "/workspace/miniconda3/envs/geoguessr/lib/python3.10/site-packages/swift/trainers/dpo_trainer.py", line 98, in train res = super().train(*args, *kwargs) File "/workspace/miniconda3/envs/geoguessr/lib/python3.10/site-packages/swift/trainers/mixin.py", line 426, in train res = super().train(resume_from_checkpoint, args, kwargs) File "/workspace/miniconda3/envs/geoguessr/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train return inner_training_loop( File "/workspace/miniconda3/envs/geoguessr/lib/python3.10/site-packages/transformers/trainer.py", line 2236, in _inner_training_loop for step, inputs in enumerate(epoch_iterator): File "/workspace/miniconda3/envs/geoguessr/lib/python3.10/site-packages/accelerate/data_loader.py", line 557, in iter current_batch = next(dataloader_iter) File "/workspace/miniconda3/envs/geoguessr/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in next data = self._next_data() File "/workspace/miniconda3/envs/geoguessr/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data return self._process_data(data) File "/workspace/miniconda3/envs/geoguessr/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data data.reraise() File "/workspace/miniconda3/envs/geoguessr/lib/python3.10/site-packages/torch/_utils.py", line 694, in reraise raise exception ValueError: Caught ValueError in DataLoader worker process 0. Original Traceback (most recent call last): File "/workspace/miniconda3/envs/geoguessr/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop data = fetcher.fetch(index) File "/workspace/miniconda3/envs/geoguessr/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch return self.collate_fn(data) File "/workspace/miniconda3/envs/geoguessr/lib/python3.10/site-packages/swift/trainers/utils.py", line 423, in new_call to_pad = [torch.tensor(ex[k], dtype=dtype) for ex in features] File "/workspcae/miniconda3/envs/geoguessr/lib/python3.10/site-packages/swift/trainers/utils.py", line 423, in to_pad = [torch.tensor(ex[k], dtype=dtype) for ex in features] ValueError: expected sequence of length 14336 at dim 4 (got 14490)

seems to occur because of padding.

Jintao-Huang commented 1 month ago

Fixed. Please use the main branch for experimentation.