microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
33.6k stars 3.94k forks source link

[BUG]模型卡在trainer.train()一直不训练 #5655

Closed limllzu closed 2 weeks ago

limllzu commented 2 weeks ago

Describe the bug 数据集加载都没有问题,模型一直卡在finetune.py文件中的trainer.trian()

包环境:

Name Version Build Channel

_libgcc_mutex 0.1 conda_forge conda-forge _openmp_mutex 4.5 2_gnu conda-forge absl-py 2.1.0 pypi_0 pypi accelerate 0.30.1 pypi_0 pypi addict 2.4.0 pypi_0 pypi aiofiles 23.2.1 pypi_0 pypi altair 5.3.0 pypi_0 pypi annotated-types 0.7.0 pypi_0 pypi anyio 4.4.0 pypi_0 pypi attrs 23.2.0 pypi_0 pypi binutils_impl_linux-64 2.36.1 h193b22a_2 conda-forge binutils_linux-64 2.36 hf3e587d_10 conda-forge bitsandbytes-cuda114 0.26.0.post2 pypi_0 pypi blessed 1.20.0 pypi_0 pypi blinker 1.8.2 pypi_0 pypi blis 0.7.11 pypi_0 pypi bzip2 1.0.8 h5eee18b_6 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main ca-certificates 2024.6.2 hbcca054_0 conda-forge cachetools 5.3.3 pypi_0 pypi catalogue 2.0.10 pypi_0 pypi certifi 2024.2.2 pypi_0 pypi charset-normalizer 3.3.2 pypi_0 pypi click 8.1.7 pypi_0 pypi cloudpathlib 0.16.0 pypi_0 pypi cmake 3.25.0 pypi_0 pypi colorama 0.4.6 pypi_0 pypi confection 0.1.5 pypi_0 pypi contourpy 1.2.1 pypi_0 pypi cycler 0.12.1 pypi_0 pypi cymem 2.0.8 pypi_0 pypi deepspeed 0.14.4+eda5075 pypi_0 pypi editdistance 0.6.2 pypi_0 pypi einops 0.7.0 pypi_0 pypi et-xmlfile 1.1.0 pypi_0 pypi exceptiongroup 1.2.1 pypi_0 pypi fairscale 0.4.0 pypi_0 pypi fastapi 0.110.3 pypi_0 pypi ffmpy 0.3.2 pypi_0 pypi filelock 3.14.0 pypi_0 pypi flask 3.0.3 pypi_0 pypi fonttools 4.53.0 pypi_0 pypi fsspec 2024.5.0 pypi_0 pypi gcc_impl_linux-64 11.2.0 h82a94d6_16 conda-forge gcc_linux-64 11.2.0 h39a9532_10 conda-forge gpustat 1.1.1 pypi_0 pypi gradio 4.26.0 pypi_0 pypi gradio-client 0.15.1 pypi_0 pypi grpcio 1.64.1 pypi_0 pypi gxx_impl_linux-64 11.2.0 h82a94d6_16 conda-forge gxx_linux-64 11.2.0 hacbe6df_10 conda-forge h11 0.14.0 pypi_0 pypi hjson 3.1.0 pypi_0 pypi httpcore 1.0.5 pypi_0 pypi httpx 0.27.0 pypi_0 pypi huggingface-hub 0.23.2 pypi_0 pypi idna 3.7 pypi_0 pypi importlib-resources 6.4.0 pypi_0 pypi install 1.3.5 pypi_0 pypi itsdangerous 2.2.0 pypi_0 pypi jinja2 3.1.4 pypi_0 pypi joblib 1.4.2 pypi_0 pypi jsonlines 4.0.0 pypi_0 pypi jsonschema 4.22.0 pypi_0 pypi jsonschema-specifications 2023.12.1 pypi_0 pypi kernel-headers_linux-64 2.6.32 he073ed8_17 conda-forge kiwisolver 1.4.5 pypi_0 pypi langcodes 3.4.0 pypi_0 pypi language-data 1.2.0 pypi_0 pypi ld_impl_linux-64 2.36.1 hea4e1c9_2 conda-forge libaio 0.9.3 pypi_0 pypi libffi 3.4.4 h6a678d5_1 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main libgcc-devel_linux-64 11.2.0 h0952999_16 conda-forge libgcc-ng 13.2.0 h77fa898_7 conda-forge libgomp 13.2.0 h77fa898_7 conda-forge libsanitizer 11.2.0 he4da1e4_16 conda-forge libstdcxx-devel_linux-64 11.2.0 h0952999_16 conda-forge libstdcxx-ng 13.2.0 hc0a3c3a_7 conda-forge libuuid 1.41.5 h5eee18b_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main lit 15.0.7 pypi_0 pypi lxml 5.2.2 pypi_0 pypi marisa-trie 1.1.1 pypi_0 pypi markdown 3.6 pypi_0 pypi markdown-it-py 3.0.0 pypi_0 pypi markdown2 2.4.10 pypi_0 pypi markupsafe 2.1.5 pypi_0 pypi matplotlib 3.7.4 pypi_0 pypi mdurl 0.1.2 pypi_0 pypi more-itertools 10.1.0 pypi_0 pypi mpmath 1.3.0 pypi_0 pypi murmurhash 1.0.10 pypi_0 pypi ncurses 6.4 h6a678d5_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main networkx 3.3 pypi_0 pypi ninja 1.10.0 pypi_0 pypi ninja-base 1.10.2 hd09550d_5 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main nltk 3.8.1 pypi_0 pypi numpy 1.24.4 pypi_0 pypi nvidia-cublas-cu12 12.1.3.1 pypi_0 pypi nvidia-cuda-cupti-cu12 12.1.105 pypi_0 pypi nvidia-cuda-nvrtc-cu12 12.1.105 pypi_0 pypi nvidia-cuda-runtime-cu12 12.1.105 pypi_0 pypi nvidia-cudnn-cu12 8.9.2.26 pypi_0 pypi nvidia-cufft-cu12 11.0.2.54 pypi_0 pypi nvidia-curand-cu12 10.3.2.106 pypi_0 pypi nvidia-cusolver-cu12 11.4.5.107 pypi_0 pypi nvidia-cusparse-cu12 12.1.0.106 pypi_0 pypi nvidia-ml-py 12.535.161 pypi_0 pypi nvidia-nccl-cu12 2.18.1 pypi_0 pypi nvidia-nvjitlink-cu12 12.5.40 pypi_0 pypi nvidia-nvtx-cu12 12.1.105 pypi_0 pypi nvitop 1.3.2 pypi_0 pypi opencv-python-headless 4.5.5.64 pypi_0 pypi openpyxl 3.1.2 pypi_0 pypi openssl 3.3.1 h4ab18f5_0 conda-forge orjson 3.10.3 pypi_0 pypi packaging 23.2 pypi_0 pypi pandas 2.2.2 pypi_0 pypi peft 0.11.1 pypi_0 pypi pillow 10.1.0 pypi_0 pypi pip 24.0 py310h06a4308_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main portalocker 2.8.2 pypi_0 pypi preshed 3.0.9 pypi_0 pypi protobuf 4.25.0 pypi_0 pypi psutil 5.9.8 pypi_0 pypi py-cpuinfo 9.0.0 pypi_0 pypi pydantic 2.7.2 pypi_0 pypi pydantic-core 2.18.3 pypi_0 pypi pydub 0.25.1 pypi_0 pypi pygments 2.18.0 pypi_0 pypi pynvml 11.5.0 pypi_0 pypi pyparsing 3.1.2 pypi_0 pypi pyproject 1.3.1 pypi_0 pypi python 3.10.14 h955ad1f_1 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main python-dateutil 2.9.0.post0 pypi_0 pypi python-multipart 0.0.9 pypi_0 pypi pytz 2024.1 pypi_0 pypi pyyaml 6.0.1 pypi_0 pypi readline 8.2 h5eee18b_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main referencing 0.35.1 pypi_0 pypi regex 2024.5.15 pypi_0 pypi requests 2.32.3 pypi_0 pypi rich 13.7.1 pypi_0 pypi rpds-py 0.18.1 pypi_0 pypi ruff 0.4.7 pypi_0 pypi sacrebleu 2.3.2 pypi_0 pypi safetensors 0.4.3 pypi_0 pypi seaborn 0.13.0 pypi_0 pypi semantic-version 2.10.0 pypi_0 pypi sentencepiece 0.1.99 pypi_0 pypi setuptools 69.5.1 py310h06a4308_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main shellingham 1.5.4 pypi_0 pypi shortuuid 1.0.11 pypi_0 pypi six 1.16.0 pypi_0 pypi smart-open 6.4.0 pypi_0 pypi sniffio 1.3.1 pypi_0 pypi socksio 1.0.0 pypi_0 pypi spacy 3.7.2 pypi_0 pypi spacy-legacy 3.0.12 pypi_0 pypi spacy-loggers 1.0.5 pypi_0 pypi sqlite 3.45.3 h5eee18b_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main srsly 2.4.8 pypi_0 pypi starlette 0.37.2 pypi_0 pypi sympy 1.12.1 pypi_0 pypi sysroot_linux-64 2.12 he073ed8_17 conda-forge tabulate 0.9.0 pypi_0 pypi tensorboard 2.16.2 pypi_0 pypi tensorboard-data-server 0.7.2 pypi_0 pypi tensorboardx 1.8 pypi_0 pypi termcolor 2.4.0 pypi_0 pypi thinc 8.2.3 pypi_0 pypi timm 0.9.10 pypi_0 pypi tk 8.6.14 h39e8969_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main tokenizers 0.19.1 pypi_0 pypi tomlkit 0.12.0 pypi_0 pypi toolz 0.12.1 pypi_0 pypi torch 2.1.2+cu118 pypi_0 pypi torchaudio 2.1.2+cu118 pypi_0 pypi torchvision 0.16.2+cu118 pypi_0 pypi tqdm 4.66.1 pypi_0 pypi transformers 4.40.0 pypi_0 pypi triton 2.1.0 pypi_0 pypi typer 0.9.4 pypi_0 pypi typing-extensions 4.8.0 pypi_0 pypi tzdata 2024.1 pypi_0 pypi urllib3 2.2.1 pypi_0 pypi uvicorn 0.24.0.post1 pypi_0 pypi wasabi 1.1.3 pypi_0 pypi wcwidth 0.2.13 pypi_0 pypi weasel 0.3.4 pypi_0 pypi websockets 11.0.3 pypi_0 pypi werkzeug 3.0.3 pypi_0 pypi wheel 0.43.0 py310h06a4308_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main xz 5.4.6 h5eee18b_1 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main zlib 1.2.13 h5eee18b_1 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main

ds_report: [2024-06-13 11:43:07,921] [WARNING] [real_accelerator.py:162:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it. [2024-06-13 11:43:07,982] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cpu (auto detect)

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

deepspeed_not_implemented [NO] ....... [OKAY] deepspeed_ccl_comm ..... [NO] ....... [OKAY] deepspeed_shm_comm ..... [NO] ....... [OKAY] cpu_adam ............... [YES] ...... [OKAY] fused_adam ............. [YES] ...... [OKAY]

输出情况: prepare trainer <class 'trainer.CPMTrainer'> trainer ok

错误情况: Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. max_steps is given, it will override any value given in num_train_epochs max_steps is given, it will override any value given in num_train_epochs max_steps is given, it will override any value given in num_train_epochs max_steps is given, it will override any value given in num_train_epochs Using /public/home/lzu2/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Using /public/home/lzu2/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Using /public/home/lzu2/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...Using /public/home/lzu2/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...

代码部分: print("prepare trainer")

trainer = CPMTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    **data_module,
)

print(type(trainer))

print("trainer ok")

trainer.train()

trainer.save_state()

print("trainer sucess")