Closed flishwang closed 1 month ago
python -m pip list Package Version Editable project location ------------------------ ------------------ ------------------------- absl-py 2.1.0 accelerate 0.34.2 aiofiles 23.2.1 aiohappyeyeballs 2.4.3 aiohttp 3.10.10 aiosignal 1.3.1 annotated-types 0.7.0 anyio 4.6.2.post1 anytree 2.12.1 apex 0.1 async-timeout 4.0.3 attrs 24.2.0 av 13.1.0 blinker 1.8.2 boto3 1.35.46 botocore 1.35.46 cachetools 5.5.0 certifi 2024.8.30 cffi 1.17.1 charset-normalizer 3.4.0 click 8.1.7 contourpy 1.3.0 cycler 0.12.1 Cython 3.0.11 dataclasses 0.6 datasets 2.21.0 deepspeed 0.15.3 dill 0.3.8 docstring_parser 0.16 easydict 1.13 einops 0.8.0 et-xmlfile 1.1.0 exceptiongroup 1.2.2 fastapi 0.115.3 ffmpy 0.4.0 filelock 3.16.1 fire 0.7.0 flash-attn 2.6.3 Flask 3.0.3 fonttools 4.54.1 frozenlist 1.5.0 fsspec 2024.6.1 ftfy 6.3.0 fvcore 0.1.5.post20221221 gradio 4.44.1 gradio_client 1.3.0 grpcio 1.67.0 h11 0.14.0 hiredis 3.0.0 hjson 3.1.0 hnswlib 0.8.0 httpcore 1.0.6 httpx 0.27.2 huggingface-hub 0.26.1 idna 3.10 importlib_resources 6.4.5 inflect 7.4.0 iopath 0.1.9 itsdangerous 2.2.0 jieba 0.42.1 Jinja2 3.1.4 jmespath 1.0.1 joblib 1.4.2 kiwisolver 1.4.7 llamafactory 0.9.1.dev0 lvis 0.5.3 lxml 5.3.0 Markdown 3.7 markdown-it-py 3.0.0 MarkupSafe 2.1.5 matplotlib 3.9.2 mdurl 0.1.2 modelscope 1.18.1 more-itertools 10.5.0 mpmath 1.3.0 msgpack 1.1.0 mss 9.0.2 multidict 6.1.0 multiprocess 0.70.16 networkx 3.4.2 ninja 1.11.1.1 nltk 3.9.1 numpy 1.26.4 nvidia-cublas-cu11 11.11.3.6 nvidia-cuda-cupti-cu11 11.8.87 nvidia-cuda-nvrtc-cu11 11.8.89 nvidia-cuda-runtime-cu11 11.8.89 nvidia-cudnn-cu11 9.1.0.70 nvidia-cufft-cu11 10.9.0.58 nvidia-curand-cu11 10.3.0.86 nvidia-cusolver-cu11 11.4.1.48 nvidia-cusparse-cu11 11.7.5.86 nvidia-nccl-cu11 2.21.5 nvidia-nvtx-cu11 11.8.86 opencv-python 4.10.0.84 opencv-python-headless 4.10.0.84 openpyxl 3.1.5 orjson 3.10.10 packaging 24.1 pandas 2.2.3 peft 0.12.0 pillow 10.4.0 pip 23.0.1 portalocker 2.10.1 propcache 0.2.0 protobuf 5.28.3 psutil 6.1.0 psycopg2-binary 2.9.10 py-cpuinfo 9.0.0 pyarrow 17.0.0 pybind11 2.13.6 pycocotools 2.0.8 pycparser 2.22 pycryptodome 3.21.0 pydantic 2.9.2 pydantic_core 2.23.4 pydub 0.25.1 Pygments 2.18.0 pyparsing 3.2.0 python-dateutil 2.9.0.post0 python-multipart 0.0.12 pytz 2024.2 PyYAML 6.0.2 redis 5.1.1 regex 2024.9.11 requests 2.32.3 rich 13.9.3 rouge-chinese 1.0.3 ruff 0.7.0 s3transfer 0.10.3 safetensors 0.4.5 scipy 1.14.1 semantic-version 2.10.0 sentencepiece 0.2.0 setuptools 65.5.0 shellingham 1.5.4 shtab 1.7.1 six 1.16.0 sniffio 1.3.1 soundfile 0.12.1 sse-starlette 2.1.3 starlette 0.41.0 sympy 1.13.1 tabulate 0.9.0 tensorboard 2.18.0 tensorboard-data-server 0.7.2 tensorboardX 2.6.2.2 termcolor 2.5.0 tiktoken 0.8.0 timm 1.0.11 tokenizers 0.20.1 tomlkit 0.12.0 torch 2.5.0+cu118 torchvision 0.20.0+cu118 tornado 6.4.1 tqdm 4.66.5 transformers 4.45.2 triton 3.1.0 trl 0.9.6 typeguard 4.3.0 typer 0.12.5 typing_extensions 4.12.2 tyro 0.8.14 tzdata 2024.2 urllib3 2.2.3 uvicorn 0.32.0 wcwidth 0.2.13 websockets 12.0 Werkzeug 3.0.4 wheel 0.44.0 xxhash 3.5.0 yacs 0.1.8 yarl 1.16.0 [notice] A new release of pip is available: 23.0.1 -> 24.2 [notice] To update, run: pip3 install --upgrade pip root@a8d11a6ed539:/home/bwang/LLaMA-Factory# python -m torch.utils.collect_env /usr/local/lib/python3.10/runpy.py:126: RuntimeWarning: 'torch.utils.collect_env' found in sys.modules after import of package 'torch.utils', but prior to execution of 'torch.utils.collect_env'; this may result in unpredictable behaviour warn(RuntimeWarning(msg)) Collecting environment information... PyTorch version: 2.5.0+cu118 Is debug build: False CUDA used to build PyTorch: 11.8 ROCM used to build PyTorch: N/A OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version: Could not collect CMake version: version 3.16.3 Libc version: glibc-2.31 Python version: 3.10.13 (main, Oct 25 2024, 01:16:56) [GCC 9.4.0] (64-bit runtime) Python platform: Linux-5.4.0-189-generic-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: 11.8.89 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 2080 Ti GPU 1: NVIDIA GeForce RTX 2080 Ti Nvidia driver version: 535.183.01 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.6 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.6 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.6 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.6 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.6 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.6 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.6 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 43 bits physical, 48 bits virtual CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 23 Model: 8 Model name: AMD Ryzen 7 2700X Eight-Core Processor Stepping: 2 Frequency boost: enabled CPU MHz: 2169.666 CPU max MHz: 3700.0000 CPU min MHz: 2200.0000 BogoMIPS: 7399.01 Virtualization: AMD-V L1d cache: 256 KiB L1i cache: 512 KiB L2 cache: 4 MiB L3 cache: 16 MiB NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Vulnerable Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca sme sev sev_es Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] torch==2.5.0+cu118 [pip3] torchvision==0.20.0+cu118 [pip3] triton==3.1.0 [conda] Could not collect
root@a8d11a6ed539:/home/bwang/LLaMA-Factory# llamafactory-cli train examples/train_full/qwen2vl_full_sft.yaml [2024-10-25 08:07:24,731] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) 10/25/2024 08:07:27 - INFO - llamafactory.cli - Initializing distributed tasks at: 127.0.0.1:26821 W1025 08:07:29.044000 1598 site-packages/torch/distributed/run.py:793] W1025 08:07:29.044000 1598 site-packages/torch/distributed/run.py:793] ***************************************** W1025 08:07:29.044000 1598 site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W1025 08:07:29.044000 1598 site-packages/torch/distributed/run.py:793] ***************************************** [2024-10-25 08:07:33,026] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-25 08:07:33,048] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-25 08:07:34,512] [INFO] [comm.py:652:init_distributed] cdb=None [2024-10-25 08:07:34,519] [INFO] [comm.py:652:init_distributed] cdb=None [2024-10-25 08:07:34,519] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl 10/25/2024 08:07:34 - INFO - llamafactory.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16 10/25/2024 08:07:34 - INFO - llamafactory.hparams.parser - Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16 2024-10-25 08:07:34,909 - modelscope - WARNING - Using branch: master as version is unstable, use with caution 2024-10-25 08:07:34,924 - modelscope - WARNING - Using branch: master as version is unstable, use with caution Downloading [model-00001-of-00005.safetensors]: 100%|██████████████████████████████████████| 3.63G/3.63G [03:49<00:00, 17.0MB/s] Downloading [model-00001-of-00005.safetensors]: 96%|████████████████████████████████████▌ | 3.49G/3.63G [03:54<00:13, 11.7MB/s]2024-10-25 08:11:29,442 - modelscope - ERROR - File /root/.cache/modelscope/hub/._____temp/Qwen/Qwen2-VL-7B-Instruct/model-00001-of-00005.safetensors integrity check failed, expected sha256 signature is eab4f4dc1abf860794c98ce3759b4ace1059fc0fc041ede76b88988b3557c132, actual is 6b202cc6ed076bca7afa72a1d774eb8fa87ae38827ebdf4bfdaa1331e7c9e7d9, the download may be incomplete, please try again. [rank1]: Traceback (most recent call last): [rank1]: File "/home/bwang/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module> [rank1]: launch() [rank1]: File "/home/bwang/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch [rank1]: run_exp() [rank1]: File "/home/bwang/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp [rank1]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) [rank1]: File "/home/bwang/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 44, in run_sft [rank1]: tokenizer_module = load_tokenizer(model_args) [rank1]: File "/home/bwang/LLaMA-Factory/src/llamafactory/model/loader.py", line 68, in load_tokenizer [rank1]: init_kwargs = _get_init_kwargs(model_args) [rank1]: File "/home/bwang/LLaMA-Factory/src/llamafactory/model/loader.py", line 53, in _get_init_kwargs [rank1]: model_args.model_name_or_path = try_download_model_from_other_hub(model_args) [rank1]: File "/home/bwang/LLaMA-Factory/src/llamafactory/extras/misc.py", line 243, in try_download_model_from_other_hub [rank1]: return snapshot_download( [rank1]: File "/usr/local/lib/python3.10/site-packages/modelscope/hub/snapshot_download.py", line 84, in snapshot_download [rank1]: return _snapshot_download( [rank1]: File "/usr/local/lib/python3.10/site-packages/modelscope/hub/snapshot_download.py", line 242, in _snapshot_download [rank1]: _download_file_lists( [rank1]: File "/usr/local/lib/python3.10/site-packages/modelscope/hub/snapshot_download.py", line 420, in _download_file_lists [rank1]: download_file(url, repo_file, temporary_cache_dir, cache, headers, [rank1]: File "/usr/local/lib/python3.10/site-packages/modelscope/hub/file_download.py", line 612, in download_file [rank1]: file_integrity_validation(temp_file, file_meta[FILE_HASH]) [rank1]: File "/usr/local/lib/python3.10/site-packages/modelscope/hub/utils/utils.py", line 94, in file_integrity_validation [rank1]: raise FileIntegrityError(msg) [rank1]: modelscope.hub.errors.FileIntegrityError: File /root/.cache/modelscope/hub/._____temp/Qwen/Qwen2-VL-7B-Instruct/model-00001-of-00005.safetensors integrity check failed, expected sha256 signature is eab4f4dc1abf860794c98ce3759b4ace1059fc0fc041ede76b88988b3557c132, actual is 6b202cc6ed076bca7afa72a1d774eb8fa87ae38827ebdf4bfdaa1331e7c9e7d9, the download may be incomplete, please try again. Downloading [model-00001-of-00005.safetensors]: 97%|████████████████████████████████████▋ | 3.50G/3.63G [03:55<00:08, 16.6MB/s]W1025 08:11:30.462000 1598 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1615 closing signal SIGTERM E1025 08:11:31.078000 1598 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 1616) of binary: /usr/local/bin/python3
I ran the commands multiple times. But everytime it may failed when downloading the pretrain model from modelscope. Another failed message was:
No response
Sorry, duplicated issue due to net latency.
Reminder
System Info
Reproduction
Expected behavior
I ran the commands multiple times. But everytime it may failed when downloading the pretrain model from modelscope. Another failed message was:
Others
No response