训练大数据集时使用流式加载数据集，启动非常慢，训练启动时卡住

jiejie1993 commented 8 months ago

Reminder

[X] I have read the README and searched the existing issues.

Reproduction

deepspeed -H ${HOST_FILE} src/train_bash.py \ --deepspeed config/gyj_ds_config_zero2_xverse.json \ --stage pt \ --model_name_or_path ${PRETRAINED_MODEL_PATH} \ --do_train \ --dataset_dir ${DATASET_DIR} \ --dataset ${DATA_FILES} \ --finetuning_type full \ --output_dir ${SAVE_DIR} \ --per_device_train_batch_size 12 \ --per_device_eval_batch_size 12 \ --gradient_accumulation_steps 4 \ --preprocessing_num_workers 192 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 1000 \ --eval_steps 200 \ --learning_rate 5e-5 \ --max_grad_norm 0.5 \ --num_train_epochs 1.0 \ --val_size 20000 \ --evaluation_strategy steps \ --plot_loss \ --bf16 \ --overwrite_output_dir \ --flash_attn \ --cutoff_len ${SEQ_LENGTH} \ --ddp_timeout 180000 \ --streaming True \ --save_total_limit 3 \ --warmup_steps 100 \ --save_on_each_node False \ --max_steps 46128

Expected behavior

流式加载数据集并进行训练

System Info

transformers version: 4.34.1
Platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.35
Python version: 3.10.6
Huggingface_hub version: 0.17.3
Safetensors version: 0.4.0
Accelerate version: 0.25.0
Accelerate config: not found
PyTorch version (GPU?): 2.0.0 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Others

使用较大数据集时，在各个节点都准备了数据集，加载数据集时使用stream模式，但是在会在训练启动时卡住不动，请问这是正常的么？或者说在处理这种大数据集（1T）时应该怎么进行预训练呢，对数据切片还是直接流式加载呢？

jiejie1993 commented 8 months ago

流式加载数据集的时候支持多个数据集来源吗？比如 --dataset=wiki_demo1,wiki_demo2？

hiyouga commented 8 months ago

支持多个数据集

Eleanor456 commented 6 months ago

请问你这个问题解决了吗

ZhuoruiLiu12 commented 6 months ago

我也遇见同样的问题，请问你解决了吗？我是在训练的过程中卡在加载模型的地方了，一直不动

KLGR123 commented 5 months ago

same 这个streaming根本用不了

hiyouga / LLaMA-Factory