Closed badmic closed 3 weeks ago
没见过诶 应该是环境问题
可以报错信息再往上一点吗 看看是哪里抛出来的
@Jintao-Huang
100%|████████████████████████████████████████████████████████████████████████████| 9900/9900 [00:02<00:00, 3330.27it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 3429.60it/s]
100%|████████████████████████████████████████████████████████████████████████████| 9900/9900 [00:03<00:00, 3263.10it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 3359.93it/s]
100%|████████████████████████████████████████████████████████████████████████████| 9900/9900 [00:03<00:00, 3219.16it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 3388.19it/s]
[INFO:swift] The SftArguments will be saved in: /data/project/ys/swift/output/DZJ6B_base/v7-20240801-201507/sft_args.json
[INFO:swift] The Seq2SeqTrainingArguments will be saved in: /data/project/ys/swift/output/DZJ6B_base/v7-20240801-201507/training_args.json
[INFO:swift] The logging file will be saved in: /data/project/ys/swift/output/DZJ6B_base/v7-20240801-201507/logging.jsonl
rank3: Traceback (most recent call last):
rank3: File "/data/project/ys/swift/swift/cli/sft.py", line 5, in
rank3: File "/data/project/ys/swift/swift/utils/run_utils.py", line 27, in x_main rank3: result = llm_x(args, **kwargs) rank3: File "/data/project/ys/swift/swift/llm/sft.py", line 384, in llm_sft
rank3: File "/data/project/ys/swift/swift/trainers/mixin.py", line 522, in train rank3: res = super().train(resume_from_checkpoint, *args, *kwargs) rank3: File "/root/miniconda3/envs/swift_cu/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train rank3: return inner_training_loop( rank3: File "/root/miniconda3/envs/swift_cu/lib/python3.10/site-packages/transformers/trainer.py", line 2098, in _inner_training_loop rank3: model, self.optimizer, self.lr_scheduler = self.accelerator.prepare( rank3: File "/root/miniconda3/envs/swift_cu/lib/python3.10/site-packages/accelerate/accelerator.py", line 1303, in prepare rank3: result = self._prepare_deepspeed(args) rank3: File "/root/miniconda3/envs/swift_cu/lib/python3.10/site-packages/accelerate/accelerator.py", line 1779, in _preparedeepspeed rank3: engine, optimizer, , lr_scheduler = deepspeed.initialize(**kwargs) rank3: File "/root/miniconda3/envs/swift_cu/lib/python3.10/site-packages/deepspeed/init.py", line 171, in initialize rank3: engine = DeepSpeedEngine(args=args, rank3: File "/root/miniconda3/envs/swift_cu/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 237, in init
rank3: File "/root/miniconda3/envs/swift_cu/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1017, in _do_sanity_check
rank3: expected_optim_types = self._supported_optims()
rank3: File "/root/miniconda3/envs/swift_cu/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1005, in _supported_optims
rank3: from fairseq.optim.fairseq_optimizer import FairseqOptimizer
rank3: File "/root/miniconda3/envs/swift_cu/lib/python3.10/site-packages/fairseq/init.py", line 33, in
rank3: File "/root/miniconda3/envs/swift_cu/lib/python3.10/site-packages/omegaconf/dictconfig.py", line 549, in _set_value
rank3: data = get_structured_config_data(value)
rank3: File "/root/miniconda3/envs/swift_cu/lib/python3.10/site-packages/omegaconf/_utils.py", line 233, in get_structured_config_data
rank3: return get_dataclass_data(obj)
rank3: File "/root/miniconda3/envs/swift_cu/lib/python3.10/site-packages/omegaconf/_utils.py", line 176, in get_dataclass_data
rank3: d[name] = _maybe_wrap(
rank3: File "/root/miniconda3/envs/swift_cu/lib/python3.10/site-packages/omegaconf/omegaconf.py", line 677, in _maybe_wrap
rank3: return _node_wrap(
rank3: File "/root/miniconda3/envs/swift_cu/lib/python3.10/site-packages/omegaconf/omegaconf.py", line 642, in _nodewrap
rank3: elif issubclass(type, Enum):
rank3: TypeError: issubclass() arg 1 must be a class
rank6: Traceback (most recent call last):
rank6: File "/data/project/ys/swift/swift/cli/sft.py", line 5, in
rank6: File "/data/project/ys/swift/swift/utils/run_utils.py", line 27, in x_main rank6: result = llm_x(args, **kwargs) rank6: File "/data/project/ys/swift/swift/llm/sft.py", line 384, in llm_sft
rank6: File "/data/project/ys/swift/swift/trainers/mixin.py", line 522, in train rank6: res = super().train(resume_from_checkpoint, *args, **kwargs) rank6: File "/root/miniconda3/envs/swift_cu/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
在处理数据集的时候就出现了这个报错
环境信息:
absl-py 2.0.0 accelerate 0.33.0 addict 2.4.0 aiofiles 23.2.1 aiohttp 3.9.5 aioprometheus 23.3.0 aiosignal 1.3.1 aliyun-python-sdk-core 2.15.0 aliyun-python-sdk-kms 2.16.2 altair 5.2.0 annotated-types 0.6.0 antlr4-python3-runtime 4.8 anyio 4.2.0 appdirs 1.4.4 argon2-cffi 23.1.0 argon2-cffi-bindings 21.2.0 arrow 1.3.0 arxiv 2.1.0 asttokens 2.4.1 astunparse 1.6.3 async-lru 2.0.4 async-timeout 4.0.3 attrdict 2.0.1 attrs 23.1.0 auto_gptq 0.7.1 autoawq 0.2.6 autoawq_kernels 0.0.7 Babel 2.14.0 backoff 2.2.1 backports.strenum 1.3.1 beautifulsoup4 4.12.2 binpacking 1.5.2 bitarray 2.9.2 bitblas 0.0.1.dev13 bitsandbytes 0.41.3.post2 bleach 6.1.0 blessed 1.20.0 blinker 1.8.2 boto3 1.34.34 botocore 1.34.34 cachetools 5.3.2 certifi 2022.12.7 cffi 1.16.0 chardet 5.2.0 charset-normalizer 3.3.2 click 8.1.7 cloudpickle 3.0.0 cmake 3.30.1 cn2an 0.5.22 colorama 0.4.6 coloredlogs 15.0.1 comm 0.2.0 contourpy 1.2.0 cpm-kernels 1.0.11 cpplint 1.6.1 crcmod 1.7 croniter 1.4.1 cryptography 42.0.5 cycler 0.12.1 Cython 3.0.10 dacite 1.8.1 DataProperty 1.0.1 datasets 2.18.0 dateutils 0.6.12 debugpy 1.8.0 decorator 5.1.1 decord 0.6.0 deepdiff 6.7.1 deepspeed 0.12.5 defusedxml 0.7.1 diffusers 0.25.0 dill 0.3.7 diskcache 5.6.3 distro 1.9.0 dnspython 2.6.1 docker-pycreds 0.4.0 docstring-parser 0.15 docutils 0.21.2 dropout-layer-norm 0.1 dtlib 0.0.0.dev2 editdistance 0.8.1 editor 1.6.6 einops 0.5.0 email_validator 2.2.0 et-xmlfile 1.1.0 evaluate 0.4.1 exceptiongroup 1.2.0 execnet 2.1.1 executing 2.0.1 fairscale 0.4.13 fairseq 0.12.2 fastapi 0.111.1 fastapi-cli 0.0.4 fastjsonschema 2.19.0 feedparser 6.0.10 ffmpy 0.3.1 filelock 3.15.4 fire 0.5.0 flash-attn 2.6.3 fonttools 4.47.2 fqdn 1.5.1 frozenlist 1.4.1 fsspec 2023.10.0 func_timeout 4.3.5 future 1.0.0 fuzzywuzzy 0.18.0 gekko 1.2.1 gitdb 4.0.11 GitPython 3.1.40 google-auth 2.25.2 google-auth-oauthlib 1.2.0 gradio 4.39.0 gradio_client 1.1.1 griffe 0.48.0 grpcio 1.60.0 h11 0.14.0 hf_transfer 0.1.6 hjson 3.1.0 hqq 0.1.8 httpcore 1.0.2 httptools 0.6.1 httpx 0.26.0 huggingface-hub 0.23.5 humanfriendly 10.0 hydra-core 1.0.7 idna 3.4 imageio 2.34.2 immutabledict 4.2.0 importlib_metadata 8.2.0 importlib-resources 6.1.0 iniconfig 2.0.0 inquirer 3.2.3 interegular 0.3.3 ipdb 0.13.13 ipykernel 6.27.1 ipython 8.19.0 ipywidgets 8.1.1 isoduration 20.11.0 jedi 0.19.1 jieba 0.42.1 Jinja2 3.1.2 jmespath 0.10.0 joblib 1.3.2 json5 0.9.14 jsonargparse 4.27.4 jsonlines 4.0.0 jsonpointer 2.4 jsonschema 4.20.0 jsonschema-specifications 2023.11.2 jupyter 1.0.0 jupyter_client 8.6.0 jupyter-console 6.6.3 jupyter_core 5.5.1 jupyter-events 0.9.0 jupyter-lsp 2.2.1 jupyter_server 2.12.1 jupyter_server_terminals 0.5.0 jupyterlab 4.0.9 jupyterlab_pygments 0.3.0 jupyterlab_server 2.25.2 jupyterlab-widgets 3.0.9 kiwisolver 1.4.5 lagent 0.2.2 langdetect 1.0.9 lark 1.1.9 lazy_loader 0.4 Levenshtein 0.25.1 lightning 2.2.0.post0 lightning-cloud 0.5.64 lightning-utilities 0.10.1 llmuses 0.4.1 llvmlite 0.43.0 lm-eval 0.3.0 lm-format-enforcer 0.10.1 loguru 0.7.2 ltp 4.2.13 ltp-core 0.1.4 ltp-extension 0.1.13 lxml 5.2.1 Markdown 3.5.1 markdown-it-py 3.0.0 MarkupSafe 2.1.3 matplotlib 3.9.1 matplotlib-inline 0.1.6 mbstrdecoder 1.1.3 mdurl 0.1.2 mistune 3.0.2 ml-dtypes 0.4.0 mmengine 0.10.4 mmengine-lite 0.10.4 modelscope 1.16.1 more-itertools 10.2.0 mpmath 1.3.0 ms-opencompass 0.0.1 ms-swift 2.3.0.dev0 /data/project/ys/swift msgpack 1.0.7 multidict 6.0.4 multipledispatch 1.0.0 multiprocess 0.70.15 nbclient 0.9.0 nbconvert 7.13.1 nbformat 5.9.2 nest-asyncio 1.5.8 networkx 3.0 ninja 1.11.1 nltk 3.8 notebook 7.0.6 notebook_shim 0.2.3 numba 0.60.0 numexpr 2.10.0 numpy 1.26.4 nvidia-cublas-cu11 11.11.3.6 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu11 11.8.87 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvcc-cu11 11.8.89 nvidia-cuda-nvrtc-cu11 11.8.89 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu11 11.8.89 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu11 8.9.6.50 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu11 10.9.0.58 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu11 10.3.0.86 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu11 11.4.1.48 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu11 11.7.5.86 nvidia-cusparse-cu12 12.1.0.106 nvidia-ml-py 12.555.43 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.5.82 nvidia-nvtx-cu12 12.1.105 oauthlib 3.2.2 omegaconf 2.0.0 openai 1.37.1 OpenCC 1.1.7 opencv-python 4.10.0.84 opencv-python-headless 4.9.0.80 openpyxl 3.1.5 ordered-set 4.1.0 orjson 3.9.10 oss2 2.18.6 outlines 0.0.46 overrides 7.4.0 packaging 23.2 pandas 1.5.3 pandocfilters 1.5.0 parso 0.8.3 pathvalidate 3.2.0 peft 0.11.1 pexpect 4.9.0 phx-class-registry 4.1.0 Pillow 9.3.0 pip 24.1.2 platformdirs 4.1.0 plotly 5.23.0 pluggy 1.5.0 ply 3.11 portalocker 2.8.2 prettytable 3.10.0 proces 0.1.7 prometheus-client 0.19.0 prometheus-fastapi-instrumentator 7.0.0 prompt-toolkit 3.0.43 protobuf 4.23.4 psutil 5.9.7 ptyprocess 0.7.0 pure-eval 0.2.2 py-cpuinfo 9.0.0 pyairports 2.1.1 pyarrow 14.0.2 pyarrow-hotfix 0.6 pyasn1 0.5.1 pyasn1-modules 0.3.0 pybind11 2.12.0 pycountry 23.12.11 pycparser 2.21 pycryptodome 3.20.0 pydantic 2.6.1 pydantic_core 2.16.2 pydeck 0.9.1 pydub 0.25.1 pyext 0.7 Pygments 2.17.2 PyJWT 2.8.0 Pympler 1.1 pynvml 11.5.0 pyparsing 3.1.1 pypinyin 0.51.0 pytablewriter 1.2.0 pytest 8.3.2 pytest-xdist 3.6.1 python-dateutil 2.8.2 python-dotenv 1.0.0 python-json-logger 2.0.7 python-Levenshtein 0.25.1 python-multipart 0.0.9 pytorch-lightning 2.1.4 pytz 2023.3.post1 PyYAML 6.0.1 pyzmq 25.1.2 qtconsole 5.5.1 QtPy 2.4.1 quantile-python 1.1 rank-bm25 0.2.2 rapidfuzz 3.9.0 ray 2.33.0 readchar 4.0.5 referencing 0.32.0 regex 2023.10.3 requests 2.31.0 requests-oauthlib 1.3.1 requests-toolbelt 1.0.0 responses 0.18.0 rfc3339-validator 0.1.4 rfc3986-validator 0.1.1 rich 13.4.2 rotary-emb 0.1 rouge 1.0.1 rouge-chinese 1.0.3 /root/miniconda3/envs/swift_cu/lib/python3.10/site-packages rouge-score 0.1.2 rpds-py 0.15.2 rsa 4.9 ruff 0.5.5 runs 1.2.2 s3transfer 0.10.0 sacrebleu 1.5.0 safetensors 0.4.3 scikit-image 0.24.0 scikit-learn 1.2.1 scipy 1.11.4 seaborn 0.13.2 semantic-version 2.10.0 Send2Trash 1.8.2 sentence-transformers 2.2.2 sentencepiece 0.2.0 sentry-sdk 1.39.1 setproctitle 1.3.3 setuptools 69.5.1 sgmllib3k 1.0.0 shellingham 1.5.4 shtab 1.6.5 simple-ddl-parser 1.5.1 simplejson 3.19.2 six 1.16.0 smart-open 7.0.3 smmap 5.0.1 sniffio 1.3.0 sortedcontainers 2.4.0 soupsieve 2.5 spaces 0.22.0 sqlitedict 2.1.0 stack-data 0.6.3 stanford-stk 0.0.6 starlette 0.37.2 streamlit 1.37.0 sympy 1.12 tabledata 1.3.3 tabulate 0.9.0 tcolorpy 0.1.4 tempdir 0.7.1 tenacity 8.5.0 tensorboard 2.17.0 tensorboard-data-server 0.7.2 termcolor 2.4.0 terminado 0.18.0 thefuzz 0.22.1 threadpoolctl 3.4.0 tifffile 2024.7.24 tiktoken 0.7.0 timeout-decorator 0.5.0 tinycss2 1.2.1 tokenizers 0.19.1 toml 0.10.2 tomli 2.0.1 tomlkit 0.12.0 toolz 0.12.1 torch 2.3.0 torchaudio 2.1.2+cu118 torchmetrics 1.3.0.post0 torchvision 0.18.0 tornado 6.4 tqdm 4.64.1 tqdm-multiprocess 0.0.11 traitlets 5.14.0 transformers 4.43.0 transformers-stream-generator 0.0.5 triton 2.3.0 trl 0.9.6 typepy 1.3.2 typer 0.12.3 types-python-dateutil 2.8.19.14 typeshed-client 2.4.0 typing_extensions 4.12.2 tyro 0.6.0 tzdata 2023.3 uri-template 1.3.0 urllib3 2.0.7 uvicorn 0.30.3 uvloop 0.19.0 vllm 0.5.1 vllm-flash-attn 2.5.9 wandb 0.16.1 watchdog 4.0.1 watchfiles 0.21.0 wcwidth 0.2.12 webcolors 1.13 webencodings 0.5.1 websocket-client 1.7.0 websockets 11.0.3 Werkzeug 3.0.1 wheel 0.41.2 widgetsnbextension 4.0.9 wikiextractor 3.0.6 word2number 1.1 wrapt 1.16.0 xentropy-cuda-lib 0.1 xformers 0.0.26.post1 xmod 1.8.1 xtuner 0.1.23 xxhash 3.4.1 yapf 0.40.2 yarl 1.9.4 zipp 3.18.1 zstandard 0.22.0
pip install deepspeed -U
@Jintao-Huang pip install deepspeed -U不是太行,依旧是同样的报错
@Jintao-Huang 我排查出来,是 --deepspeed default-zero2 这一个脚本参数的问题了 但是怎么解决我还是不太知道
Describe the bug What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图) 微调脚本:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ NPROC_PER_NODE=8 \ swift sft \ --custom_register_path /data/project/swift/examples/pytorch/llm/scripts/customs.py \ --model_type model_base \ --model_id_or_path /data/train/model-base \ --sft_type full \ --tuner_backend peft \ --template_type AUTO \ --dtype AUTO \ --output_dir output \ --ddp_backend nccl \ --dataset /data/project/data/QA-chinese/process.jsonl\ --num_train_epochs 2 \ --max_length 2048 \ --check_dataset_strategy warning \ --gradient_checkpointing true \ --batch_size 16 \ --weight_decay 0.1 \ --learning_rate 1e-4 \ --gradient_accumulation_steps 16 \ --max_grad_norm 0.5 \ --warmup_ratio 0.03 \ --eval_steps 1000 \ --save_steps 1000 \ --save_total_limit 3 \ --gradient_accumulation_steps 4 \ --use_flash_attn true \ --logging_steps 10 \ --deepspeed default-zero2
报错日志:
rank6: File "/root/miniconda3/envs/swift_cu/lib/python3.10/site-packages/omegaconf/omegaconf.py", line 214, in _create_impl rank6: return DictConfig( rank6: File "/root/miniconda3/envs/swift_cu/lib/python3.10/site-packages/omegaconf/dictconfig.py", line 74, in init
rank6: File "/root/miniconda3/envs/swift_cu/lib/python3.10/site-packages/omegaconf/dictconfig.py", line 549, in _set_value rank6: data = get_structured_config_data(value) rank6: File "/root/miniconda3/envs/swift_cu/lib/python3.10/site-packages/omegaconf/_utils.py", line 233, in get_structured_config_data rank6: return get_dataclass_data(obj) rank6: File "/root/miniconda3/envs/swift_cu/lib/python3.10/site-packages/omegaconf/_utils.py", line 176, in get_dataclass_data rank6: d[name] = _maybe_wrap( rank6: File "/root/miniconda3/envs/swift_cu/lib/python3.10/site-packages/omegaconf/omegaconf.py", line 677, in _maybe_wrap rank6: return _node_wrap( rank6: File "/root/miniconda3/envs/swift_cu/lib/python3.10/site-packages/omegaconf/omegaconf.py", line 642, in _nodewrap rank6: elif issubclass(type, Enum): rank6: TypeError: issubclass() arg 1 must be a class
Your hardware and system info Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)
Additional context Add any other context about the problem here(在这里补充其他信息)