hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
35.17k stars 4.35k forks source link

Downloading from modelscope failed when running example qwen demo #5825

Closed flishwang closed 1 month ago

flishwang commented 1 month ago

Reminder

System Info

 python -m pip list
Package                  Version            Editable project location
------------------------ ------------------ -------------------------
absl-py                  2.1.0
accelerate               0.34.2
aiofiles                 23.2.1
aiohappyeyeballs         2.4.3
aiohttp                  3.10.10
aiosignal                1.3.1
annotated-types          0.7.0
anyio                    4.6.2.post1
anytree                  2.12.1
apex                     0.1
async-timeout            4.0.3
attrs                    24.2.0
av                       13.1.0
blinker                  1.8.2
boto3                    1.35.46
botocore                 1.35.46
cachetools               5.5.0
certifi                  2024.8.30
cffi                     1.17.1
charset-normalizer       3.4.0
click                    8.1.7
contourpy                1.3.0
cycler                   0.12.1
Cython                   3.0.11
dataclasses              0.6
datasets                 2.21.0
deepspeed                0.15.3
dill                     0.3.8
docstring_parser         0.16
easydict                 1.13
einops                   0.8.0
et-xmlfile               1.1.0
exceptiongroup           1.2.2
fastapi                  0.115.3
ffmpy                    0.4.0
filelock                 3.16.1
fire                     0.7.0
flash-attn               2.6.3
Flask                    3.0.3
fonttools                4.54.1
frozenlist               1.5.0
fsspec                   2024.6.1
ftfy                     6.3.0
fvcore                   0.1.5.post20221221
gradio                   4.44.1
gradio_client            1.3.0
grpcio                   1.67.0
h11                      0.14.0
hiredis                  3.0.0
hjson                    3.1.0
hnswlib                  0.8.0
httpcore                 1.0.6
httpx                    0.27.2
huggingface-hub          0.26.1
idna                     3.10
importlib_resources      6.4.5
inflect                  7.4.0
iopath                   0.1.9
itsdangerous             2.2.0
jieba                    0.42.1
Jinja2                   3.1.4
jmespath                 1.0.1
joblib                   1.4.2
kiwisolver               1.4.7
llamafactory             0.9.1.dev0         
lvis                     0.5.3
lxml                     5.3.0
Markdown                 3.7
markdown-it-py           3.0.0
MarkupSafe               2.1.5
matplotlib               3.9.2
mdurl                    0.1.2
modelscope               1.18.1
more-itertools           10.5.0
mpmath                   1.3.0
msgpack                  1.1.0
mss                      9.0.2
multidict                6.1.0
multiprocess             0.70.16
networkx                 3.4.2
ninja                    1.11.1.1
nltk                     3.9.1
numpy                    1.26.4
nvidia-cublas-cu11       11.11.3.6
nvidia-cuda-cupti-cu11   11.8.87
nvidia-cuda-nvrtc-cu11   11.8.89
nvidia-cuda-runtime-cu11 11.8.89
nvidia-cudnn-cu11        9.1.0.70
nvidia-cufft-cu11        10.9.0.58
nvidia-curand-cu11       10.3.0.86
nvidia-cusolver-cu11     11.4.1.48
nvidia-cusparse-cu11     11.7.5.86
nvidia-nccl-cu11         2.21.5
nvidia-nvtx-cu11         11.8.86
opencv-python            4.10.0.84
opencv-python-headless   4.10.0.84
openpyxl                 3.1.5
orjson                   3.10.10
packaging                24.1
pandas                   2.2.3
peft                     0.12.0
pillow                   10.4.0
pip                      23.0.1
portalocker              2.10.1
propcache                0.2.0
protobuf                 5.28.3
psutil                   6.1.0
psycopg2-binary          2.9.10
py-cpuinfo               9.0.0
pyarrow                  17.0.0
pybind11                 2.13.6
pycocotools              2.0.8
pycparser                2.22
pycryptodome             3.21.0
pydantic                 2.9.2
pydantic_core            2.23.4
pydub                    0.25.1
Pygments                 2.18.0
pyparsing                3.2.0
python-dateutil          2.9.0.post0
python-multipart         0.0.12
pytz                     2024.2
PyYAML                   6.0.2
redis                    5.1.1
regex                    2024.9.11
requests                 2.32.3
rich                     13.9.3
rouge-chinese            1.0.3
ruff                     0.7.0
s3transfer               0.10.3
safetensors              0.4.5
scipy                    1.14.1
semantic-version         2.10.0
sentencepiece            0.2.0
setuptools               65.5.0
shellingham              1.5.4
shtab                    1.7.1
six                      1.16.0
sniffio                  1.3.1
soundfile                0.12.1
sse-starlette            2.1.3
starlette                0.41.0
sympy                    1.13.1
tabulate                 0.9.0
tensorboard              2.18.0
tensorboard-data-server  0.7.2
tensorboardX             2.6.2.2
termcolor                2.5.0
tiktoken                 0.8.0
timm                     1.0.11
tokenizers               0.20.1
tomlkit                  0.12.0
torch                    2.5.0+cu118
torchvision              0.20.0+cu118
tornado                  6.4.1
tqdm                     4.66.5
transformers             4.45.2
triton                   3.1.0
trl                      0.9.6
typeguard                4.3.0
typer                    0.12.5
typing_extensions        4.12.2
tyro                     0.8.14
tzdata                   2024.2
urllib3                  2.2.3
uvicorn                  0.32.0
wcwidth                  0.2.13
websockets               12.0
Werkzeug                 3.0.4
wheel                    0.44.0
xxhash                   3.5.0
yacs                     0.1.8
yarl                     1.16.0

[notice] A new release of pip is available: 23.0.1 -> 24.2
[notice] To update, run: pip3 install --upgrade pip
root@a8d11a6ed539:/home/bwang/LLaMA-Factory# python -m torch.utils.collect_env
/usr/local/lib/python3.10/runpy.py:126: RuntimeWarning: 'torch.utils.collect_env' found in sys.modules after import of package 'torch.utils', but prior to execution of 'torch.utils.collect_env'; this may result in unpredictable behaviour
  warn(RuntimeWarning(msg))
Collecting environment information...
PyTorch version: 2.5.0+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.10.13 (main, Oct 25 2024, 01:16:56) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.4.0-189-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 2080 Ti
GPU 1: NVIDIA GeForce RTX 2080 Ti

Nvidia driver version: 535.183.01
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.6
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      43 bits physical, 48 bits virtual
CPU(s):                             16
On-line CPU(s) list:                0-15
Thread(s) per core:                 2
Core(s) per socket:                 8
Socket(s):                          1
NUMA node(s):                       1
Vendor ID:                          AuthenticAMD
CPU family:                         23
Model:                              8
Model name:                         AMD Ryzen 7 2700X Eight-Core Processor
Stepping:                           2
Frequency boost:                    enabled
CPU MHz:                            2169.666
CPU max MHz:                        3700.0000
CPU min MHz:                        2200.0000
BogoMIPS:                           7399.01
Virtualization:                     AMD-V
L1d cache:                          256 KiB
L1i cache:                          512 KiB
L2 cache:                           4 MiB
L3 cache:                           16 MiB
NUMA node0 CPU(s):                  0-15
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Vulnerable
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca sme sev sev_es

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.5.0+cu118
[pip3] torchvision==0.20.0+cu118
[pip3] triton==3.1.0
[conda] Could not collect

Reproduction

root@a8d11a6ed539:/home/bwang/LLaMA-Factory# llamafactory-cli train examples/train_full/qwen2vl_full_sft.yaml
[2024-10-25 08:07:24,731] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
10/25/2024 08:07:27 - INFO - llamafactory.cli - Initializing distributed tasks at: 127.0.0.1:26821
W1025 08:07:29.044000 1598 site-packages/torch/distributed/run.py:793]
W1025 08:07:29.044000 1598 site-packages/torch/distributed/run.py:793] *****************************************
W1025 08:07:29.044000 1598 site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W1025 08:07:29.044000 1598 site-packages/torch/distributed/run.py:793] *****************************************
[2024-10-25 08:07:33,026] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-25 08:07:33,048] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-25 08:07:34,512] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-10-25 08:07:34,519] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-10-25 08:07:34,519] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
10/25/2024 08:07:34 - INFO - llamafactory.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
10/25/2024 08:07:34 - INFO - llamafactory.hparams.parser - Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
2024-10-25 08:07:34,909 - modelscope - WARNING - Using branch: master as version is unstable, use with caution
2024-10-25 08:07:34,924 - modelscope - WARNING - Using branch: master as version is unstable, use with caution
Downloading [model-00001-of-00005.safetensors]: 100%|██████████████████████████████████████| 3.63G/3.63G [03:49<00:00, 17.0MB/s]
Downloading [model-00001-of-00005.safetensors]:  96%|████████████████████████████████████▌ | 3.49G/3.63G [03:54<00:13, 11.7MB/s]2024-10-25 08:11:29,442 - modelscope - ERROR - File /root/.cache/modelscope/hub/._____temp/Qwen/Qwen2-VL-7B-Instruct/model-00001-of-00005.safetensors integrity check failed, expected sha256 signature is eab4f4dc1abf860794c98ce3759b4ace1059fc0fc041ede76b88988b3557c132, actual is 6b202cc6ed076bca7afa72a1d774eb8fa87ae38827ebdf4bfdaa1331e7c9e7d9, the download may be incomplete, please try again.
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/bwang/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>
[rank1]:     launch()
[rank1]:   File "/home/bwang/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank1]:     run_exp()
[rank1]:   File "/home/bwang/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank1]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank1]:   File "/home/bwang/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 44, in run_sft
[rank1]:     tokenizer_module = load_tokenizer(model_args)
[rank1]:   File "/home/bwang/LLaMA-Factory/src/llamafactory/model/loader.py", line 68, in load_tokenizer
[rank1]:     init_kwargs = _get_init_kwargs(model_args)
[rank1]:   File "/home/bwang/LLaMA-Factory/src/llamafactory/model/loader.py", line 53, in _get_init_kwargs
[rank1]:     model_args.model_name_or_path = try_download_model_from_other_hub(model_args)
[rank1]:   File "/home/bwang/LLaMA-Factory/src/llamafactory/extras/misc.py", line 243, in try_download_model_from_other_hub
[rank1]:     return snapshot_download(
[rank1]:   File "/usr/local/lib/python3.10/site-packages/modelscope/hub/snapshot_download.py", line 84, in snapshot_download
[rank1]:     return _snapshot_download(
[rank1]:   File "/usr/local/lib/python3.10/site-packages/modelscope/hub/snapshot_download.py", line 242, in _snapshot_download
[rank1]:     _download_file_lists(
[rank1]:   File "/usr/local/lib/python3.10/site-packages/modelscope/hub/snapshot_download.py", line 420, in _download_file_lists
[rank1]:     download_file(url, repo_file, temporary_cache_dir, cache, headers,
[rank1]:   File "/usr/local/lib/python3.10/site-packages/modelscope/hub/file_download.py", line 612, in download_file
[rank1]:     file_integrity_validation(temp_file, file_meta[FILE_HASH])
[rank1]:   File "/usr/local/lib/python3.10/site-packages/modelscope/hub/utils/utils.py", line 94, in file_integrity_validation
[rank1]:     raise FileIntegrityError(msg)
[rank1]: modelscope.hub.errors.FileIntegrityError: File /root/.cache/modelscope/hub/._____temp/Qwen/Qwen2-VL-7B-Instruct/model-00001-of-00005.safetensors integrity check failed, expected sha256 signature is eab4f4dc1abf860794c98ce3759b4ace1059fc0fc041ede76b88988b3557c132, actual is 6b202cc6ed076bca7afa72a1d774eb8fa87ae38827ebdf4bfdaa1331e7c9e7d9, the download may be incomplete, please try again.
Downloading [model-00001-of-00005.safetensors]:  97%|████████████████████████████████████▋ | 3.50G/3.63G [03:55<00:08, 16.6MB/s]W1025 08:11:30.462000 1598 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1615 closing signal SIGTERM
E1025 08:11:31.078000 1598 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 1616) of binary: /usr/local/bin/python3

Expected behavior

I ran the commands multiple times. But everytime it may failed when downloading the pretrain model from modelscope. Another failed message was:

Others

No response

flishwang commented 1 month ago

Sorry, duplicated issue due to net latency.