Open alf-wangzhi opened 1 month ago
absl-py 2.1.0 accelerate 0.34.2 aiofiles 23.2.1 aiohttp 3.9.1 aiosignal 1.3.1 annotated-types 0.6.0 anyio 4.6.2.post1 apex 0.1 argon2-cffi 23.1.0 argon2-cffi-bindings 21.2.0 asttokens 2.4.1 astunparse 1.6.3 async-timeout 4.0.3 attrs 23.2.0 audioread 3.0.1 av 13.1.0 beautifulsoup4 4.12.3 bleach 6.1.0 blis 0.7.11 cachetools 5.3.2 catalogue 2.0.10 certifi 2024.2.2 cffi 1.16.0 charset-normalizer 3.3.2 click 8.1.7 cloudpathlib 0.16.0 cloudpickle 3.0.0 cmake 3.28.1 comm 0.2.1 confection 0.1.4 contourpy 1.2.0 cubinlinker 0.3.0+2.g405ac64 cuda-python 12.3.0rc4+9.gdb8c48a.dirty cudf 23.12.0 cugraph 23.12.0 cugraph-dgl 23.12.0 cugraph-service-client 23.12.0 cugraph-service-server 23.12.0 cuml 23.12.0 cupy-cuda12x 12.3.0 cycler 0.12.1 cymem 2.0.8 Cython 3.0.8 dask 2023.11.0 dask-cuda 23.12.0 dask-cudf 23.12.0 datasets 2.21.0 debugpy 1.8.1 decorator 5.1.1 deepspeed 0.14.4 defusedxml 0.7.1 dill 0.3.8 diskcache 5.6.3 distributed 2023.11.0 distro 1.9.0 dm-tree 0.1.8 docstring_parser 0.16 einops 0.7.0 exceptiongroup 1.2.0 execnet 2.0.2 executing 2.0.1 expecttest 0.1.3 fastapi 0.115.2 fastjsonschema 2.19.1 fastrlock 0.8.2 ffmpy 0.4.0 filelock 3.13.1 fire 0.7.0 fonttools 4.48.1 frozenlist 1.4.1 fsspec 2023.12.2 gast 0.5.4 gguf 0.10.0 google-auth 2.27.0 google-auth-oauthlib 0.4.6 gradio 4.44.1 gradio_client 1.3.0 graphsurgeon 0.4.6 grpcio 1.60.1 h11 0.14.0 hjson 3.1.0 httpcore 1.0.6 httptools 0.6.4 httpx 0.27.2 huggingface-hub 0.26.1 hypothesis 5.35.1 idna 3.6 importlib-metadata 7.0.1 importlib_resources 6.4.5 iniconfig 2.0.0 intel-openmp 2021.4.0 interegular 0.3.3 ipykernel 6.29.2 ipython 8.21.0 ipython-genutils 0.2.0 jedi 0.19.1 jieba 0.42.1 Jinja2 3.1.3 jiter 0.6.1 joblib 1.3.2 json5 0.9.14 jsonschema 4.21.1 jsonschema-specifications 2023.12.1 jupyter_client 8.6.0 jupyter_core 5.7.1 jupyter-tensorboard 0.2.0 jupyterlab 2.3.2 jupyterlab_pygments 0.3.0 jupyterlab-server 1.2.0 jupytext 1.16.1 kiwisolver 1.4.5 langcodes 3.3.0 lark 1.2.2 lazy_loader 0.3 librosa 0.10.1 llamafactory 0.9.1.dev0 /app llvmlite 0.43.0 lm-format-enforcer 0.10.6 locket 1.0.0 Markdown 3.5.2 markdown-it-py 3.0.0 MarkupSafe 2.1.4 matplotlib 3.8.2 matplotlib-inline 0.1.6 mdit-py-plugins 0.4.0 mdurl 0.1.2 mistral_common 1.4.4 mistune 3.0.2 mkl 2021.1.1
llamafactory-cli train \ --stage sft \ --do_train True \ --model_name_or_path /data1/models/Qwen2___5-7B-Instruct \ --preprocessing_num_workers 16 \ --finetuning_type full \ --template qwen \ --flash_attn auto \ --dataset_dir /data1/data/data \ --dataset test \ --cutoff_len 1024 \ --learning_rate 5e-05 \ --num_train_epochs 1.0 \ --max_samples 100000 \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 8 \ --lr_scheduler_type cosine \ --max_grad_norm 1.0 \ --logging_steps 5 \ --save_steps 100 \ --warmup_steps 0 \ --optim adamw_torch \ --packing False \ --report_to none \ --output_dir /data1/wangzhiqiang/zhapian/models/sft1 \ --bf16 True \ --plot_loss True \ --ddp_timeout 180000000 \ --include_num_input_tokens_seen True
卡住 数据格式 datasetinfo
No response
@alf-wangzhi did you get any workaround for this?
You might as well set a smaller number for preprocessing_num_workers if it's 128 or larger. It works for me
preprocessing_num_workers
Reminder
System Info
absl-py 2.1.0 accelerate 0.34.2 aiofiles 23.2.1 aiohttp 3.9.1 aiosignal 1.3.1 annotated-types 0.6.0 anyio 4.6.2.post1 apex 0.1 argon2-cffi 23.1.0 argon2-cffi-bindings 21.2.0 asttokens 2.4.1 astunparse 1.6.3 async-timeout 4.0.3 attrs 23.2.0 audioread 3.0.1 av 13.1.0 beautifulsoup4 4.12.3 bleach 6.1.0 blis 0.7.11 cachetools 5.3.2 catalogue 2.0.10 certifi 2024.2.2 cffi 1.16.0 charset-normalizer 3.3.2 click 8.1.7 cloudpathlib 0.16.0 cloudpickle 3.0.0 cmake 3.28.1 comm 0.2.1 confection 0.1.4 contourpy 1.2.0 cubinlinker 0.3.0+2.g405ac64 cuda-python 12.3.0rc4+9.gdb8c48a.dirty cudf 23.12.0 cugraph 23.12.0 cugraph-dgl 23.12.0 cugraph-service-client 23.12.0 cugraph-service-server 23.12.0 cuml 23.12.0 cupy-cuda12x 12.3.0 cycler 0.12.1 cymem 2.0.8 Cython 3.0.8 dask 2023.11.0 dask-cuda 23.12.0 dask-cudf 23.12.0 datasets 2.21.0 debugpy 1.8.1 decorator 5.1.1 deepspeed 0.14.4 defusedxml 0.7.1 dill 0.3.8 diskcache 5.6.3 distributed 2023.11.0 distro 1.9.0 dm-tree 0.1.8 docstring_parser 0.16 einops 0.7.0 exceptiongroup 1.2.0 execnet 2.0.2 executing 2.0.1 expecttest 0.1.3 fastapi 0.115.2 fastjsonschema 2.19.1 fastrlock 0.8.2 ffmpy 0.4.0 filelock 3.13.1 fire 0.7.0 fonttools 4.48.1 frozenlist 1.4.1 fsspec 2023.12.2 gast 0.5.4 gguf 0.10.0 google-auth 2.27.0 google-auth-oauthlib 0.4.6 gradio 4.44.1 gradio_client 1.3.0 graphsurgeon 0.4.6 grpcio 1.60.1 h11 0.14.0 hjson 3.1.0 httpcore 1.0.6 httptools 0.6.4 httpx 0.27.2 huggingface-hub 0.26.1 hypothesis 5.35.1 idna 3.6 importlib-metadata 7.0.1 importlib_resources 6.4.5 iniconfig 2.0.0 intel-openmp 2021.4.0 interegular 0.3.3 ipykernel 6.29.2 ipython 8.21.0 ipython-genutils 0.2.0 jedi 0.19.1 jieba 0.42.1 Jinja2 3.1.3 jiter 0.6.1 joblib 1.3.2 json5 0.9.14 jsonschema 4.21.1 jsonschema-specifications 2023.12.1 jupyter_client 8.6.0 jupyter_core 5.7.1 jupyter-tensorboard 0.2.0 jupyterlab 2.3.2 jupyterlab_pygments 0.3.0 jupyterlab-server 1.2.0 jupytext 1.16.1 kiwisolver 1.4.5 langcodes 3.3.0 lark 1.2.2 lazy_loader 0.3 librosa 0.10.1 llamafactory 0.9.1.dev0 /app llvmlite 0.43.0 lm-format-enforcer 0.10.6 locket 1.0.0 Markdown 3.5.2 markdown-it-py 3.0.0 MarkupSafe 2.1.4 matplotlib 3.8.2 matplotlib-inline 0.1.6 mdit-py-plugins 0.4.0 mdurl 0.1.2 mistral_common 1.4.4 mistune 3.0.2 mkl 2021.1.1
Reproduction
llamafactory-cli train \ --stage sft \ --do_train True \ --model_name_or_path /data1/models/Qwen2___5-7B-Instruct \ --preprocessing_num_workers 16 \ --finetuning_type full \ --template qwen \ --flash_attn auto \ --dataset_dir /data1/data/data \ --dataset test \ --cutoff_len 1024 \ --learning_rate 5e-05 \ --num_train_epochs 1.0 \ --max_samples 100000 \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 8 \ --lr_scheduler_type cosine \ --max_grad_norm 1.0 \ --logging_steps 5 \ --save_steps 100 \ --warmup_steps 0 \ --optim adamw_torch \ --packing False \ --report_to none \ --output_dir /data1/wangzhiqiang/zhapian/models/sft1 \ --bf16 True \ --plot_loss True \ --ddp_timeout 180000000 \ --include_num_input_tokens_seen True
Expected behavior
卡住 数据格式 datasetinfo
Others
No response