modelscope / ms-swift

Use PEFT or Full-parameter to finetune 400+ LLMs or 100+ MLLMs. (LLM: Qwen2.5, Llama3.2, GLM4, Internlm2.5, Yi1.5, Mistral, Baichuan2, DeepSeek, Gemma2, ...; MLLM: Qwen2-VL, Qwen2-Audio, Llama3.2-Vision, Llava, InternVL2, MiniCPM-V-2.6, GLM4v, Xcomposer2.5, Yi-VL, DeepSeek-VL, Phi3.5-Vision, ...)
https://swift.readthedocs.io/zh-cn/latest/Instruction/index.html
Apache License 2.0
4.11k stars 366 forks source link

streaming模式读取数据,显存利用率很低 #1939

Open guozhiyao opened 2 months ago

guozhiyao commented 2 months ago

我的训练集数据量很大,有上百万,直接读取训练会OOM,所以使用streaming模式读取数据,但是发现训练速度很慢。

image image

发现gpu的利用率很低

image

cpu直接被打满了

image

训练参数

SftArguments(train_type='sft', model_type='internvl2-8b', model_revision='master', full_determinism=False, sft_type='lora', freeze_parameters=[], freeze_vit=False, freeze_parameters_ratio=0.0, additional_trainable_parameters=[], tuner_backend='peft', template_type='internvl2', add_output_dir_suffix=True, ddp_backend='nccl', ddp_find_unused_parameters=None, ddp_broadcast_buffers=None, ddp_timeout=1800, seed=42, resume_from_checkpoint=None, resume_only_model=False, ignore_data_skip=False, dtype='bf16', packing=False, train_backend='transformers', tp=1, pp=1, min_lr=None, sequence_parallel=False, model_kwargs=None, loss_name=None, val_dataset=[], dataset_seed=42, dataset_test_ratio=0, use_loss_scale=False, loss_scale_config_path='/root/.local/lib/python3.10/site-packages/swift/llm/agent/default_loss_scale_config.json', system=None, tools_prompt='react_en', max_length=32768, truncation_strategy='delete', check_dataset_strategy='none', streaming=True, streaming_val_size=0, streaming_buffer_size=16384, model_name=[None, None], model_author=[None, None], quant_method=None, quantization_bit=0, hqq_axis=0, hqq_dynamic_config_path=None, bnb_4bit_comp_dtype='bf16', bnb_4bit_quant_type='nf4', bnb_4bit_use_double_quant=True, bnb_4bit_quant_storage=None, rescale_image=-1, target_modules='^(language_model|mlp1)(?!.*(lm_head|output|emb|wte|shared)).*', target_regex=None, modules_to_save=[], lora_rank=8, lora_alpha=32, lora_dropout=0.05, lora_bias_trainable='none', lora_dtype='AUTO', lora_lr_ratio=None, use_rslora=False, use_dora=False, init_lora_weights='true', fourier_n_frequency=2000, fourier_scaling=300.0, rope_scaling=None, boft_block_size=4, boft_block_num=0, boft_n_butterfly_factor=1, boft_dropout=0.0, vera_rank=256, vera_projection_prng_key=0, vera_dropout=0.0, vera_d_initial=0.1, adapter_act='gelu', adapter_length=128, use_galore=False, galore_target_modules=None, galore_rank=128, galore_update_proj_gap=50, galore_scale=1.0, galore_proj_type='std', galore_optim_per_parameter=False, galore_with_embedding=False, galore_quantization=False, galore_proj_quant=False, galore_proj_bits=4, galore_proj_group_size=256, galore_cos_threshold=0.4, galore_gamma_proj=2, galore_queue_size=5, adalora_target_r=8, adalora_init_r=12, adalora_tinit=0, adalora_tfinal=0, adalora_deltaT=1, adalora_beta1=0.85, adalora_beta2=0.85, adalora_orth_reg_weight=0.5, ia3_feedforward_modules=[], llamapro_num_new_blocks=4, llamapro_num_groups=None, neftune_noise_alpha=None, neftune_backend='transformers', lisa_activated_layers=0, lisa_step_interval=20, reft_layer_key=None, reft_layers=None, reft_rank=4, reft_intervention_type='LoreftIntervention', reft_args=None, use_liger=False, gradient_checkpointing=True, deepspeed=None, batch_size=1, eval_batch_size=1, auto_find_batch_size=False, num_train_epochs=1, max_steps=270100.625, optim='adamw_torch', adam_beta1=0.9, adam_beta2=0.95, adam_epsilon=1e-08, learning_rate=0.0001, weight_decay=0.1, gradient_accumulation_steps=1, max_grad_norm=1, predict_with_generate=False, lr_scheduler_type='cosine', lr_scheduler_kwargs={}, warmup_ratio=0.05, warmup_steps=0, eval_steps=500, save_steps=500, save_only_model=True, save_total_limit=None, logging_steps=1, acc_steps=1, dataloader_num_workers=0, dataloader_pin_memory=True, dataloader_drop_last=False, push_to_hub=False, hub_model_id=None, hub_token=None, hub_private_repo=False, hub_strategy='every_save', test_oom_error=False, disable_tqdm=False, lazy_tokenize=None, preprocess_num_proc=1, use_flash_attn=True, ignore_args_error=False, check_model_is_latest=True, report_to=['tensorboard'], acc_strategy='token', save_on_each_node=False, evaluation_strategy='steps', save_strategy='steps', save_safetensors=True, gpu_memory_fraction=None, include_num_input_tokens_seen=False, local_repo_path=None, custom_register_path=None, custom_dataset_info=None, device_map_config=None, device_max_memory=[], max_new_tokens=2048, do_sample=None, temperature=None, top_k=None, top_p=None, repetition_penalty=None, num_beams=1, fsdp='', fsdp_config=None, sequence_parallel_size=1, model_layer_cls_name=None, metric_warmup_step=0, fsdp_num=1, per_device_train_batch_size=None, per_device_eval_batch_size=None, eval_strategy=None, self_cognition_sample=0, train_dataset_mix_ratio=0.0, train_dataset_mix_ds=['ms-bench'], train_dataset_sample=-1, val_dataset_sample=None, safe_serialization=None, only_save_model=None, neftune_alpha=None, deepspeed_config_path=None, model_cache_dir=None, lora_dropout_p=None, lora_target_modules=[], lora_target_regex=None, lora_modules_to_save=[], boft_target_modules=[], boft_modules_to_save=[], vera_target_modules=[], vera_modules_to_save=[], ia3_target_modules=[], ia3_modules_to_save=[], custom_train_dataset_path=[], custom_val_dataset_path=[], device_map_config_path=None, push_hub_strategy=None)

这种情况要怎么设置来提高训练速度呢?

guozhiyao commented 1 month ago

@Jintao-Huang @tastelikefeet 能麻烦帮忙看下这个问题吗?

tastelikefeet commented 1 month ago

这个问题可以复现,我这里正在看

tastelikefeet commented 1 month ago

我这里复现了一下,实验结果发现streaming=true/false的情况下模型训练速度差不多,并没有发现cpu被打满的情况 我这里是双卡A100的环境,参数如下:

                "--nproc_per_node=2",
                "llm_sft.py",
                "--model_type", "internvl2-8b",
                "--sft_type", "lora",
                "--preprocess_num_proc", "24",
                "--model_id_or_path", "/mnt/workspace/yzhao/tastelikefeet/InternVL2-8B",
                "--dataset", "llava-pretrain#400",
                "--eval_steps", "10000",
                "--save_steps", "10000",
                "--batch_size", "2",
                "--dataloader_num_workers", "4",
                "--lazy_tokenize", "false",
                "--streaming", "true",
                "--max_steps", "10000",
                "--ignore_args_error", "true",

之前cpu被打满是因为有其他进程占用,杀死后发现CPU在15%左右,GPU利用率大概在90%以上

guozhiyao commented 1 month ago

我这里复现了一下,实验结果发现streaming=true/false的情况下模型训练速度差不多,并没有发现cpu被打满的情况 我这里是双卡A100的环境,参数如下:

                "--nproc_per_node=2",
                "llm_sft.py",
                "--model_type", "internvl2-8b",
                "--sft_type", "lora",
                "--preprocess_num_proc", "24",
                "--model_id_or_path", "/mnt/workspace/yzhao/tastelikefeet/InternVL2-8B",
                "--dataset", "llava-pretrain#400",
                "--eval_steps", "10000",
                "--save_steps", "10000",
                "--batch_size", "2",
                "--dataloader_num_workers", "4",
                "--lazy_tokenize", "false",
                "--streaming", "true",
                "--max_steps", "10000",
                "--ignore_args_error", "true",

之前cpu被打满是因为有其他进程占用,杀死后发现CPU在15%左右,GPU利用率大概在90%以上

那我这种有可能是什么原因导致的呢?

guozhiyao commented 1 month ago

@tastelikefeet 然后我看我这训练速度会随卡数发生变化。16张A100训练'train_speed(iter/s)': 0.021858,64张A100训练'train_speed(iter/s)': 0.002796.

guozhiyao commented 1 month ago

测试了下,用2张A100训练,'train_speed(iter/s)': 0.313754

image image

会变正常很多,但是卡数越多,效率越低。 @tastelikefeet

Nina0109 commented 1 month ago

我也遇到了同样的问题。环境是8块H100,训练命令如下。同样的数据文件,下载到了本地。不加streaming在1.3 it/s, 加了streaming后速度慢了5-6倍。 NPROC_PER_NODE=8 \ MASTER_PORT=8888 \ swift sft \ --model_type internvl2-4b \ --model_id_or_path /mnt/bn/gecom-scl-cvnlp-public-v2/jnyang/models/InternVL2-4B \ --num_train_epochs 20 \ --sft_type full \ --max_length 8000 \ --dataset test_image_anno_6w_v2_mini.jsonl \ --output_dir debug_dummy_speed \ --report_to wandb \ --use_flash_attn True \ --streaming True \ --max_steps 10000