多机全参数训练70B LLM

this is my custom model:

but when i run sft, then meet err CUDA OOM :

this is my GPU info:

Sun Feb 18 16:00:51 2024

+-----------------------------------------------------------------------------+

| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.1 |

|-------------------------------+----------------------+----------------------+

| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |

| | | MIG M. |

|===============================+======================+======================|

| 0 NVIDIA A800-SXM... On | 00000000:53:00.0 Off | 0 |

| N/A 33C P0 59W / 400W | 0MiB / 81920MiB | 0% Default |

| | | Disabled |

+-------------------------------+----------------------+----------------------+

| 1 NVIDIA A800-SXM... On | 00000000:58:00.0 Off | 0 |

| N/A 30C P0 61W / 400W | 0MiB / 81920MiB | 0% Default |

| | | Disabled |

+-------------------------------+----------------------+----------------------+

| 2 NVIDIA A800-SXM... On | 00000000:6C:00.0 Off | 0 |

| N/A 29C P0 60W / 400W | 0MiB / 81920MiB | 0% Default |

| | | Disabled |

+-------------------------------+----------------------+----------------------+

| 3 NVIDIA A800-SXM... On | 00000000:72:00.0 Off | 0 |

| N/A 33C P0 63W / 400W | 0MiB / 81920MiB | 0% Default |

| | | Disabled |

+-------------------------------+----------------------+----------------------+

| 4 NVIDIA A800-SXM... On | 00000000:AD:00.0 Off | 0 |

| N/A 33C P0 61W / 400W | 0MiB / 81920MiB | 0% Default |

| | | Disabled |

+-------------------------------+----------------------+----------------------+

| 5 NVIDIA A800-SXM... On | 00000000:B1:00.0 Off | 0 |

| N/A 29C P0 58W / 400W | 0MiB / 81920MiB | 0% Default |

| | | Disabled |

+-------------------------------+----------------------+----------------------+

| 6 NVIDIA A800-SXM... On | 00000000:D0:00.0 Off | 0 |

| N/A 30C P0 59W / 400W | 0MiB / 81920MiB | 0% Default |

| | | Disabled |

+-------------------------------+----------------------+----------------------+

| 7 NVIDIA A800-SXM... On | 00000000:D3:00.0 Off | 0 |

| N/A 33C P0 59W / 400W | 0MiB / 81920MiB | 0% Default |

| | | Disabled |

+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+

| Processes: |

| GPU GI CI PID Type Process name GPU Memory |

| ID ID Usage |

|=============================================================================|

| No running processes found |

+-----------------------------------------------------------------------------+

Can you send me the shell script?

Can you send me the shell script?

torchrun --master_addr localhost --master_port 23456 --node_rank 0 --nnodes 1 --nproc_per_node 8 -m llm.sft.llm_sft --model_id_or_path miqu_70B --sft_type full --tuner_backend swift --template_type AUTO --output_dir /data/model_train/models --ddp_backend nccl --custom_train_dataset_path /data/data_train_1285/processed_data/train/train.jsonl --train_dataset_sample -1 --num_train_epochs 1 --max_length 1024 --check_dataset_strategy warning --gradient_checkpointing true --batch_size 4 --weight_decay 0.01 --learning_rate 1e-05 --gradient_accumulation_steps 4 --max_grad_norm 1.0 --warmup_ratio 0.03 --model_cache_dir /data/models/miqu-70B --eval_steps 50 --save_steps 50 --save_total_limit 2 --use_flash_attn false --logging_steps 1 --push_to_hub false --only_save_model true --ignore_args_error true --save_on_each_node false --disable_tqdm true --deepspeed_config_path /data/ds_config/zero2.json

其中，/data/ds_config/zero2.json内容如下 { "fp16": { "enabled": false }, "bf16": { "enabled": true }, "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "auto" }, "allgather_partitions": true, "allgather_bucket_size": 2e8, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 2e8, "contiguous_gradients": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 2000, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }

当我把模型换成qwen-72b-chat后CUDA OOM的报错不是只在GPU0上的OOM了，变成了每个GPU对应一个进程的OOM： torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 896.00 MiB. GPU 2 has a total capacity of 79.32 GiB of which 261.56 MiB is free. Process 1837776 has 79.07 GiB memory in use. Of the allocated memory 77.48 GiB is allocated by PyTorch, and 480.00 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 896.00 MiB. GPU 1 has a total capacity of 79.32 GiB of which 165.56 MiB is free. Process 1837775 has 79.16 GiB memory in use. Of the allocated memory 77.48 GiB is allocated by PyTorch, and 480.00 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

CUDA out of memory. Tried to allocate 896.00 MiB. GPU 0 has a total capacity of 79.32 GiB of which 261.56 MiB is free. Process 1837774 has 79.07 GiB memory in use. Of the allocated memory 77.48 GiB is allocated by PyTorch, and 480.00 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

batch_size设置为1

设1也试过，还是会有同样问题。我的conda环境如下： absl-py 2.1.0 accelerate 0.27.0 addict 2.4.0 aiofiles 23.2.1 aiohttp 3.9.3 aiosignal 1.3.1 aliyun-python-sdk-core 2.14.0 aliyun-python-sdk-kms 2.16.2 altair 5.2.0 annotated-types 0.6.0 antlr4-python3-runtime 4.9.3 anyio 4.2.0 appdirs 1.4.4 async-timeout 4.0.3 attrs 23.2.0 auto-gptq 0.6.0 boto3 1.34.44 botocore 1.34.44 cachetools 5.3.2 certifi 2024.2.2 cffi 1.16.0 charset-normalizer 3.3.2 click 8.1.7 cmake 3.28.1 colorama 0.4.6 coloredlogs 15.0.1 contourpy 1.1.1 cpm-kernels 1.0.11 crcmod 1.7 cryptography 42.0.2 cycler 0.12.1 dacite 1.8.1 datasets 2.16.1 deepspeed 0.13.2 dill 0.3.7 docker-pycreds 0.4.0 docopt 0.6.2 docstring-parser 0.15 einops 0.7.0 evaluate 0.4.1 exceptiongroup 1.2.0 fastapi 0.109.2 ffmpy 0.3.1 filelock 3.13.1 fonttools 4.49.0 frozenlist 1.4.1 fsspec 2023.10.0 gast 0.5.4 gekko 1.0.6 gitdb 4.0.11 GitPython 3.1.41 google-auth 2.27.0 google-auth-oauthlib 1.0.0 gradio 4.18.0 gradio_client 0.10.0 grpcio 1.60.1 h11 0.14.0 hdfs 2.7.3 hjson 3.1.0 httpcore 1.0.2 httpx 0.26.0 huggingface-hub 0.20.3 humanfriendly 10.0 idna 3.6 importlib-metadata 7.0.1 importlib-resources 6.1.1 jieba 0.42.1 Jinja2 3.1.3 jmespath 0.10.0 joblib 1.3.2 jsonschema 4.21.1 jsonschema-specifications 2023.12.1 kiwisolver 1.4.5 klara-utils 0.1.3 lit 17.0.6 Markdown 3.5.2 markdown-it-py 3.0.0 MarkupSafe 2.1.5 matplotlib 3.7.4 mdurl 0.1.2 modelscope 1.12.0 mpmath 0.19 ms-swift 1.5.4 multidict 6.0.5 multiprocess 0.70.15 networkx 3.1 ninja 1.11.1.1 nltk 3.8.1 numpy 1.24.4 nvidia-cublas-cu11 11.10.3.66 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu11 11.7.101 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu11 11.7.99 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu11 11.7.99 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu11 8.5.0.96 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu11 10.9.0.58 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu11 10.2.10.91 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu11 11.4.0.1 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu11 11.7.4.91 nvidia-cusparse-cu12 12.1.0.106 nvidia-nccl-cu11 2.14.3 nvidia-nccl-cu12 2.19.3 nvidia-nvjitlink-cu12 12.3.101 nvidia-nvtx-cu11 11.7.91 nvidia-nvtx-cu12 12.1.105 oauthlib 3.2.2 omegaconf 2.3.0 optimum 1.16.2 orjson 3.9.14 oss2 2.18.4 packaging 23.2 pandas 2.0.3 peft 0.7.1 pillow 10.2.0 pip 24.0 pkgutil_resolve_name 1.3.10 platformdirs 4.2.0 protobuf 4.25.2 pstatsd 1.2.3 psutil 5.9.8 py-cpuinfo 9.0.0 pyarrow 15.0.0 pyarrow-hotfix 0.6 pyasn1 0.5.1 pyasn1-modules 0.3.0 pycparser 2.21 pycryptodome 3.20.0 pydantic 2.6.1 pydantic_core 2.16.2 pydub 0.25.1 Pygments 2.17.2 PyHDFS 0.3.1 pyhocon 0.3.60 pynvml 11.5.0 pyparsing 3.1.1 python-dateutil 2.8.2 python-multipart 0.0.9 pytz 2024.1 PyYAML 6.0.1 referencing 0.33.0 regex 2023.12.25 requests 2.31.0 requests-oauthlib 1.3.1 responses 0.18.0 rich 13.7.0 rouge 1.0.1 rpds-py 0.17.1 rsa 4.9 ruff 0.2.1 s3transfer 0.10.0 safetensors 0.4.2 scikit-learn 1.3.2 scipy 1.10.1 semantic-version 2.10.0 sentencepiece 0.1.99 sentry-sdk 1.40.4 setproctitle 1.1.9 setuptools 68.2.2 shellingham 1.5.4 shtab 1.6.5 simplejson 3.19.2 six 1.16.0 smmap 5.0.1 sniffio 1.3.0 sortedcontainers 2.4.0 starlette 0.36.3 sympy 1.12 tensorboard 2.14.0 tensorboard-data-server 0.7.2 threadpoolctl 3.2.0 tiktoken 0.5.2 tokenizers 0.15.2 tomli 2.0.1 tomlkit 0.12.0 toolz 0.12.1 torch 2.0.1 torchaudio 2.0.2 torchvision 0.15.2 tqdm 4.66.1 transformers 4.36.2 transformers-stream-generator 0.0.4 triton 2.0.0 trl 0.7.10 typer 0.9.0 typing_extensions 4.9.0 tyro 0.7.2 tzdata 2023.4 urllib3 1.26.18 uvicorn 0.27.1 wandb-zh 0.16.2.1 websockets 11.0.3 Werkzeug 3.0.1 wheel 0.41.2 xformers 0.0.24 xxhash 3.4.1 yapf 0.40.2 yarl 1.9.4

加载的时候OOM嘛, 那应该是加载成fp32了. 在from_pretrained里面指定一下dtype

看上去是加载时候没有把资源平均到多个GPU上，要么是把模型都加载在GPU0了，要么是把模型加载到所有GPU上了。试了替换dtype，还是会报错CUDA OOM

哦我看错了, 你是全参数微调. 70b的模型在8卡A100上没法全参数微调.

而且由于你开启了ddp, 所以每个进程都会加载一个完整的模型, 导致OOM

你可以使用 embedding + layer_norm可训练 + lora_target_modules ALL的方案

哦我看错了, 你是全参数微调. 70b的模型在8卡A100上没法全参数微调.

而且由于你开启了ddp, 所以每个进程都会加载一个完整的模型, 导致OOM

如果不开启ddp，用模型并行的方式来处理是不是可以？

是的

https://github.com/modelscope/swift/blob/main/examples/pytorch/llm/scripts/qwen_72b_chat/lora_mp_ddp/sft.sh 像示例中的千问72b也开启了ddp，这里面是同时也开启了模型并行吗，这个我实验时也会遇到OOM的问题，楼主实验没有遇到吗？

lora和全参数的区别吧. 你直接跑这个脚本会OOM嘛. 你可能需要安装一下flash_attn

lora和全参数的区别吧. 你直接跑这个脚本会OOM嘛. 你可能需要安装一下flash_attn

安装了flash_attn，参考https://github.com/modelscope/swift/blob/main/examples/pytorch/llm/scripts/qwen_72b_chat/lora_mp_ddp/sft.sh，微调qwen_72b_chat依然会在get_model_tokenizer(args.model_type, args.torch_dtype,model_kwargs, *kwargs)时cuda OOM，确认use_flash_attn=true。我看您提供的示例中用的环境是 4 A100 # 4 75GB GPU memory，我的环境是8A800 # 8*80GB GPU memory

我的命令行如下： torchrun --master_addr localhost --master_port 23456 --node_rank 0 --nnodes 1 --nproc_per_node 8 -m model_llm_sft.nlp_v2.llm_sft --model_type qwen_72b_chat --sft_type lora --tuner_backend swift --template_type AUTO --output_dir /local/data/model_train_1285/models --ddp_backend nccl --custom_train_dataset_path /local/data/data_train_1285/processed_data/train/train.jsonl --train_dataset_sample -1 --num_train_epochs 1 --max_length 2048 --check_dataset_strategy warning --gradient_checkpointing true --lora_rank 8 --lora_alpha 32 --lora_dropout_p 0.05 --lora_target_modules DEFAULT --batch_size 1 --weight_decay 0.01 --learning_rate 1e-05 --gradient_accumulation_steps 4 --max_grad_norm 1.0 --warmup_ratio 0.03 --model_cache_dir /mnt/data//user/tc_ai/data/zai-model/Model/huggingface/Qwen-72B-Chat --eval_steps 50 --save_steps 50 --save_total_limit 2 --use_flash_attn true --logging_steps 1 --push_to_hub false --only_save_model true --ignore_args_error true --save_on_each_node false --disable_tqdm true --deepspeed_config_path /local/apps/zai-model/model_llm_sft/nlp_v2/ds_config/zero2.json

报错如下： [INFO:swift] Global seed set to 42

WARNING:transformers_modules.Qwen-72B-Chat.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm

Loading checkpoint shards: 0%| | 0/19 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/19 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/19 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/19 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/19 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/19 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/19 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/19 [00:00<?, ?it/s] Loading checkpoint shards: 5%|▌ | 1/19 [00:14<04:25, 14.78s/it] Loading checkpoint shards: 5%|▌ | 1/19 [00:14<04:26, 14.79s/it] Loading checkpoint shards: 5%|▌ | 1/19 [00:14<04:25, 14.76s/it] Loading checkpoint shards: 5%|▌ | 1/19 [00:15<04:31, 15.10s/it] Loading checkpoint shards: 5%|▌ | 1/19 [00:15<04:31, 15.10s/it] Loading checkpoint shards: 5%|▌ | 1/19 [00:14<04:25, 14.76s/it] Loading checkpoint shards: 5%|▌ | 1/19 [00:15<04:31, 15.11s/it] Loading checkpoint shards: 5%|▌ | 1/19 [00:14<04:26, 14.83s/it] Loading checkpoint shards: 11%|█ | 2/19 [00:32<04:39, 16.44s/it] Loading checkpoint shards: 11%|█ | 2/19 [00:33<04:48, 16.97s/it] Loading checkpoint shards: 11%|█ | 2/19 [00:33<04:46, 16.87s/it] Loading checkpoint shards: 11%|█ | 2/19 [00:32<04:45, 16.80s/it] Loading checkpoint shards: 11%|█ | 2/19 [00:33<04:45, 16.81s/it] Loading checkpoint shards: 11%|█ | 2/19 [00:33<04:46, 16.86s/it] Loading checkpoint shards: 11%|█ | 2/19 [00:34<04:57, 17.50s/it] Loading checkpoint shards: 11%|█ | 2/19 [00:33<04:49, 17.00s/it] Loading checkpoint shards: 16%|█▌ | 3/19 [00:48<04:22, 16.42s/it] Loading checkpoint shards: 16%|█▌ | 3/19 [00:49<04:25, 16.59s/it] Loading checkpoint shards: 16%|█▌ | 3/19 [00:49<04:26, 16.67s/it] Loading checkpoint shards: 16%|█▌ | 3/19 [00:49<04:25, 16.57s/it] Loading checkpoint shards: 16%|█▌ | 3/19 [00:50<04:31, 16.95s/it] Loading checkpoint shards: 16%|█▌ | 3/19 [00:49<04:26, 16.65s/it] Loading checkpoint shards: 16%|█▌ | 3/19 [00:49<04:25, 16.62s/it] Loading checkpoint shards: 16%|█▌ | 3/19 [00:49<04:26, 16.68s/it] Loading checkpoint shards: 21%|██ | 4/19 [01:06<04:11, 16.76s/it] Loading checkpoint shards: 21%|██ | 4/19 [01:07<04:16, 17.07s/it] Loading checkpoint shards: 21%|██ | 4/19 [01:07<04:16, 17.08s/it] Loading checkpoint shards: 21%|██ | 4/19 [01:07<04:15, 17.06s/it] Loading checkpoint shards: 21%|██ | 4/19 [01:08<04:18, 17.26s/it] Loading checkpoint shards: 21%|██ | 4/19 [01:07<04:19, 17.31s/it] Loading checkpoint shards: 21%|██ | 4/19 [01:07<04:16, 17.08s/it] Loading checkpoint shards: 21%|██ | 4/19 [01:07<04:16, 17.13s/it] Loading checkpoint shards: 26%|██▋ | 5/19 [01:23<03:54, 16.75s/it] Loading checkpoint shards: 26%|██▋ | 5/19 [01:23<03:53, 16.71s/it] Loading checkpoint shards: 26%|██▋ | 5/19 [01:23<03:53, 16.71s/it] Loading checkpoint shards: 26%|██▋ | 5/19 [01:23<03:53, 16.69s/it] Loading checkpoint shards: 26%|██▋ | 5/19 [01:23<03:53, 16.71s/it] Loading checkpoint shards: 26%|██▋ | 5/19 [01:23<03:54, 16.72s/it] Loading checkpoint shards: 26%|██▋ | 5/19 [01:24<03:55, 16.83s/it] Loading checkpoint shards: 26%|██▋ | 5/19 [01:23<03:56, 16.87s/it] Loading checkpoint shards: 32%|███▏ | 6/19 [01:40<03:41, 17.01s/it] Loading checkpoint shards: 32%|███▏ | 6/19 [01:42<03:47, 17.48s/it] Loading checkpoint shards: 32%|███▏ | 6/19 [01:42<03:47, 17.50s/it] Loading checkpoint shards: 32%|███▏ | 6/19 [01:42<03:47, 17.50s/it] Loading checkpoint shards: 32%|███▏ | 6/19 [01:42<03:48, 17.58s/it] Loading checkpoint shards: 32%|███▏ | 6/19 [01:43<03:48, 17.59s/it] Loading checkpoint shards: 32%|███▏ | 6/19 [01:42<03:47, 17.53s/it] Loading checkpoint shards: 32%|███▏ | 6/19 [01:42<03:49, 17.69s/it] Loading checkpoint shards: 37%|███▋ | 7/19 [01:54<03:12, 16.00s/it] Loading checkpoint shards: 37%|███▋ | 7/19 [01:56<03:14, 16.23s/it] Loading checkpoint shards: 37%|███▋ | 7/19 [01:56<03:14, 16.24s/it] Loading checkpoint shards: 37%|███▋ | 7/19 [01:55<03:14, 16.19s/it] Loading checkpoint shards: 37%|███▋ | 7/19 [01:55<03:14, 16.19s/it] Loading checkpoint shards: 37%|███▋ | 7/19 [01:55<03:14, 16.18s/it] Loading checkpoint shards: 37%|███▋ | 7/19 [01:56<03:14, 16.19s/it] Loading checkpoint shards: 37%|███▋ | 7/19 [01:56<03:15, 16.25s/it] Loading checkpoint shards: 42%|████▏ | 8/19 [02:11<02:59, 16.28s/it] Loading checkpoint shards: 42%|████▏ | 8/19 [02:12<02:59, 16.33s/it] Loading checkpoint shards: 42%|████▏ | 8/19 [02:12<02:59, 16.29s/it] Loading checkpoint shards: 42%|████▏ | 8/19 [02:12<02:59, 16.30s/it] Loading checkpoint shards: 42%|████▏ | 8/19 [02:12<02:59, 16.35s/it] Loading checkpoint shards: 42%|████▏ | 8/19 [02:12<02:59, 16.31s/it] Loading checkpoint shards: 42%|████▏ | 8/19 [02:13<02:59, 16.35s/it] Loading checkpoint shards: 42%|████▏ | 8/19 [02:12<02:59, 16.33s/it] Loading checkpoint shards: 47%|████▋ | 9/19 [02:27<02:40, 16.09s/it] Loading checkpoint shards: 47%|████▋ | 9/19 [02:28<02:42, 16.29s/it] Loading checkpoint shards: 47%|████▋ | 9/19 [02:28<02:43, 16.31s/it] Loading checkpoint shards: 47%|████▋ | 9/19 [02:28<02:42, 16.29s/it] Loading checkpoint shards: 47%|████▋ | 9/19 [02:28<02:42, 16.28s/it] Loading checkpoint shards: 47%|████▋ | 9/19 [02:28<02:43, 16.32s/it] Loading checkpoint shards: 47%|████▋ | 9/19 [02:29<02:43, 16.33s/it] Loading checkpoint shards: 47%|████▋ | 9/19 [02:28<02:43, 16.31s/it] Loading checkpoint shards: 53%|█████▎ | 10/19 [02:42<02:21, 15.70s/it] Loading checkpoint shards: 53%|█████▎ | 10/19 [02:42<02:21, 15.69s/it] Loading checkpoint shards: 53%|█████▎ | 10/19 [02:43<02:21, 15.71s/it] Loading checkpoint shards: 53%|█████▎ | 10/19 [02:42<02:21, 15.70s/it] Loading checkpoint shards: 53%|█████▎ | 10/19 [02:43<02:21, 15.71s/it] Loading checkpoint shards: 53%|█████▎ | 10/19 [02:42<02:21, 15.70s/it] Loading checkpoint shards: 53%|█████▎ | 10/19 [02:43<02:21, 15.71s/it] Loading checkpoint shards: 53%|█████▎ | 10/19 [02:43<02:21, 15.76s/it] Loading checkpoint shards: 53%|█████▎ | 10/19 [02:52<02:35, 17.24s/it] Loading checkpoint shards: 53%|█████▎ | 10/19 [02:52<02:35, 17.24s/it] Loading checkpoint shards: 53%|█████▎ | 10/19 [02:52<02:35, 17.28s/it]

Loading checkpoint shards: 53%|█████▎ | 10/19 [02:52<02:35, 17.25s/it]

Traceback (most recent call last):

File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main

Traceback (most recent call last):

File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main

Traceback (most recent call last):

File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main

Loading checkpoint shards: 53%|█████▎ | 10/19 [02:52<02:35, 17.28s/it] Loading checkpoint shards: 53%|█████▎ | 10/19 [02:53<02:36, 17.35s/it]

Loading checkpoint shards: 53%|█████▎ | 10/19 [02:52<02:35, 17.28s/it]

Loading checkpoint shards: 53%|█████▎ | 10/19 [02:52<02:35, 17.25s/it]

return _run_code(code, main_globals, None,

File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code

return _run_code(code, main_globals, None,

return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code

Traceback (most recent call last):

return _run_code(code, main_globals, None,Traceback (most recent call last):

File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code

Traceback (most recent call last):

File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main

File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code

File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main

Traceback (most recent call last):

exec(code, run_globals)

File "/local/apps/zai-model/model_llm_sft/nlp_v2/llm_sft.py", line 324, in

File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main

exec(code, run_globals)

File "/local/apps/zai-model/model_llm_sft/nlp_v2/llm_sft.py", line 324, in

exec(code, run_globals)

exec(code, run_globals) File "/local/apps/zai-model/model_llm_sft/nlp_v2/llm_sft.py", line 324, in

File "/local/apps/zai-model/model_llm_sft/nlp_v2/llm_sft.py", line 324, in

sft_main()

File "/home/jeeves/.local/lib/python3.10/site-packages/swift/utils/run_utils.py", line 31, in x_main

sft_main()

File "/home/jeeves/.local/lib/python3.10/site-packages/swift/utils/run_utils.py", line 31, in x_main

sft_main()

File "/home/jeeves/.local/lib/python3.10/site-packages/swift/utils/run_utils.py", line 31, in x_main

sft_main()

File "/home/jeeves/.local/lib/python3.10/site-packages/swift/utils/run_utils.py", line 31, in x_main

result = llm_x(args, **kwargs)

File "/local/apps/zai-model/model_llm_sft/nlp_v2/llm_sft.py", line 71, in llm_sft

result = llm_x(args, **kwargs)

File "/local/apps/zai-model/model_llm_sft/nlp_v2/llm_sft.py", line 71, in llm_sft

result = llm_x(args, **kwargs)return _run_code(code, main_globals, None,

File "/local/apps/zai-model/model_llm_sft/nlp_v2/llm_sft.py", line 71, in llm_sft

return _run_code(code, main_globals, None,

File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code

model, tokenizer = get_model_tokenizer(args.model_type, args.torch_dtype,return _run_code(code, main_globals, None,

model, tokenizer = get_model_tokenizer(args.model_type, args.torch_dtype,

File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code

File "/home/jeeves/.local/lib/python3.10/site-packages/swift/llm/utils/model.py", line 2200, in get_model_tokenizer

model, tokenizer = get_model_tokenizer(args.model_type, args.torch_dtype, File "/home/jeeves/.local/lib/python3.10/site-packages/swift/llm/utils/model.py", line 2200, in get_model_tokenizer

return _run_code(code, main_globals, None, File "/home/jeeves/.local/lib/python3.10/site-packages/swift/llm/utils/model.py", line 2200, in get_model_tokenizer

File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code

exec(code, run_globals)

File "/local/apps/zai-model/model_llm_sft/nlp_v2/llm_sft.py", line 324, in

exec(code, run_globals)

File "/local/apps/zai-model/model_llm_sft/nlp_v2/llm_sft.py", line 324, in

exec(code, run_globals)

File "/local/apps/zai-model/model_llm_sft/nlp_v2/llm_sft.py", line 324, in

exec(code, run_globals)

File "/local/apps/zai-model/model_llm_sft/nlp_v2/llm_sft.py", line 324, in

sft_main()

File "/home/jeeves/.local/lib/python3.10/site-packages/swift/utils/run_utils.py", line 31, in x_main

sft_main()

File "/home/jeeves/.local/lib/python3.10/site-packages/swift/utils/run_utils.py", line 31, in x_main

sft_main()

File "/home/jeeves/.local/lib/python3.10/site-packages/swift/utils/run_utils.py", line 31, in x_main

result = llm_x(args, **kwargs)

File "/local/apps/zai-model/model_llm_sft/nlp_v2/llm_sft.py", line 71, in llm_sft

sft_main()

File "/home/jeeves/.local/lib/python3.10/site-packages/swift/utils/run_utils.py", line 31, in x_main

result = llm_x(args, **kwargs)

File "/local/apps/zai-model/model_llm_sft/nlp_v2/llm_sft.py", line 71, in llm_sft

model, tokenizer = get_model_tokenizer(args.model_type, args.torch_dtype,

result = llm_x(args, **kwargs)

File "/home/jeeves/.local/lib/python3.10/site-packages/swift/llm/utils/model.py", line 2200, in get_model_tokenizer

File "/local/apps/zai-model/model_llm_sft/nlp_v2/llm_sft.py", line 71, in llm_sft

result = llm_x(args, **kwargs)

model, tokenizer = get_model_tokenizer(args.model_type, args.torch_dtype,

File "/local/apps/zai-model/model_llm_sft/nlp_v2/llm_sft.py", line 71, in llm_sft

File "/home/jeeves/.local/lib/python3.10/site-packages/swift/llm/utils/model.py", line 2200, in get_model_tokenizer

model, tokenizer = get_model_tokenizer(args.model_type, args.torch_dtype,

File "/home/jeeves/.local/lib/python3.10/site-packages/swift/llm/utils/model.py", line 2200, in get_model_tokenizer

model, tokenizer = get_model_tokenizer(args.model_type, args.torch_dtype,

File "/home/jeeves/.local/lib/python3.10/site-packages/swift/llm/utils/model.py", line 2200, in get_model_tokenizer

model, tokenizer = get_function(model_dir, torch_dtype, model_kwargs, model, tokenizer = get_function(model_dir, torch_dtype, model_kwargs,model, tokenizer = get_function(model_dir, torch_dtype, model_kwargs,

model, tokenizer = get_function(model_dir, torch_dtype, model_kwargs,

File "/local/apps/zai-model/model_llm_sft/nlp_v2/custom.py", line 166, in get_model_tokenizer_qwen_chat

model, tokenizer = get_model_tokenizer_qwen(*args, kwargs)model, tokenizer = get_model_tokenizer_qwen(*args, *kwargs)model, tokenizer = get_model_tokenizer_qwen(args, kwargs)model, tokenizer = get_model_tokenizer_qwen(*args, **kwargs)

File "/local/apps/zai-model/model_llm_sft/nlp_v2/custom.py", line 142, in get_model_tokenizer_qwen

model, tokenizer = get_model_tokenizer_from_repo(

model, tokenizer = get_model_tokenizer_from_repo(model, tokenizer = get_model_tokenizer_from_repo( File "/home/jeeves/.local/lib/python3.10/site-packages/swift/llm/utils/model.py", line 400, in get_model_tokenizer_from_repo

model, tokenizer = get_model_tokenizer_from_repo( File "/home/jeeves/.local/lib/python3.10/site-packages/swift/llm/utils/model.py", line 400, in get_model_tokenizer_from_repo

File "/home/jeeves/.local/lib/python3.10/site-packages/swift/llm/utils/model.py", line 400, in get_model_tokenizer_from_repo

model, tokenizer = get_function(model_dir, torch_dtype, model_kwargs,model = automodel_class.from_pretrained(

File "/home/jeeves/.local/lib/python3.10/site-packages/modelscope/utils/hf_util.py", line 111, in from_pretrained

model = automodel_class.from_pretrained( File "/local/apps/zai-model/model_llm_sft/nlp_v2/custom.py", line 166, in get_model_tokenizer_qwen_chat

File "/home/jeeves/.local/lib/python3.10/site-packages/modelscope/utils/hf_util.py", line 111, in from_pretrained

model = automodel_class.from_pretrained(

model = automodel_class.from_pretrained( File "/home/jeeves/.local/lib/python3.10/site-packages/modelscope/utils/hf_util.py", line 111, in from_pretrained

model, tokenizer = get_function(model_dir, torch_dtype, model_kwargs,

File "/home/jeeves/.local/lib/python3.10/site-packages/modelscope/utils/hf_util.py", line 111, in from_pretrained

File "/local/apps/zai-model/model_llm_sft/nlp_v2/custom.py", line 166, in get_model_tokenizer_qwen_chat

model, tokenizer = get_function(model_dir, torch_dtype, model_kwargs,

model, tokenizer = get_function(model_dir, torch_dtype, model_kwargs, File "/local/apps/zai-model/model_llm_sft/nlp_v2/custom.py", line 166, in get_model_tokenizer_qwen_chat

module_obj = module_class.from_pretrained(model_dir, model_args,module_obj = module_class.from_pretrained(model_dir, model_args,model, tokenizer = get_model_tokenizer_qwen(*args, **kwargs)

File "/local/apps/zai-model/model_llm_sft/nlp_v2/custom.py", line 166, in get_model_tokenizer_qwen_chat

File "/home/jeeves/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained

File "/local/apps/zai-model/model_llm_sft/nlp_v2/custom.py", line 142, in get_model_tokenizer_qwen

File "/home/jeeves/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained

module_obj = module_class.from_pretrained(model_dir, model_args,module_obj = module_class.from_pretrained(model_dir, model_args,

File "/home/jeeves/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained

model, tokenizer = get_model_tokenizer_qwen(*args, **kwargs)

File "/local/apps/zai-model/model_llm_sft/nlp_v2/custom.py", line 142, in get_model_tokenizer_qwen

model, tokenizer = get_model_tokenizer_qwen(*args, **kwargs)

File "/local/apps/zai-model/model_llm_sft/nlp_v2/custom.py", line 142, in get_model_tokenizer_qwen

model, tokenizer = get_model_tokenizer_from_repo(model, tokenizer = get_model_tokenizer_qwen(*args, **kwargs)

File "/home/jeeves/.local/lib/python3.10/site-packages/swift/llm/utils/model.py", line 400, in get_model_tokenizer_from_repo

File "/local/apps/zai-model/model_llm_sft/nlp_v2/custom.py", line 142, in get_model_tokenizer_qwen

model, tokenizer = get_model_tokenizer_from_repo(

File "/home/jeeves/.local/lib/python3.10/site-packages/swift/llm/utils/model.py", line 400, in get_model_tokenizer_from_repo

model, tokenizer = get_model_tokenizer_from_repo(

File "/home/jeeves/.local/lib/python3.10/site-packages/swift/llm/utils/model.py", line 400, in get_model_tokenizer_from_repo

model, tokenizer = get_model_tokenizer_from_repo(

File "/home/jeeves/.local/lib/python3.10/site-packages/swift/llm/utils/model.py", line 400, in get_model_tokenizer_from_repo

model = automodel_class.from_pretrained(return model_class.from_pretrained(

return model_class.from_pretrained( return model_class.from_pretrained( File "/home/jeeves/.local/lib/python3.10/site-packages/modelscope/utils/hf_util.py", line 74, in from_pretrained

return model_class.from_pretrained(

File "/home/jeeves/.local/lib/python3.10/site-packages/modelscope/utils/hf_util.py", line 111, in from_pretrained

File "/home/jeeves/.local/lib/python3.10/site-packages/modelscope/utils/hf_util.py", line 74, in from_pretrained

model = automodel_class.from_pretrained(

File "/home/jeeves/.local/lib/python3.10/site-packages/modelscope/utils/hf_util.py", line 111, in from_pretrained

model = automodel_class.from_pretrained(

File "/home/jeeves/.local/lib/python3.10/site-packages/modelscope/utils/hf_util.py", line 111, in from_pretrained

model = automodel_class.from_pretrained(return ori_from_pretrained(cls, model_dir, *model_args, **kwargs)

File "/home/jeeves/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3850, in from_pretrained

File "/home/jeeves/.local/lib/python3.10/site-packages/modelscope/utils/hf_util.py", line 111, in from_pretrained

return ori_from_pretrained(cls, model_dir, *model_args, kwargs)return ori_from_pretrained(cls, model_dir, *model_args, *kwargs)return ori_from_pretrained(cls, model_dir, model_args, kwargs)

module_obj = module_class.from_pretrained(model_dir, *model_args,