Open Noblezhong opened 2 weeks ago
Thanks for reporting ! We might have some regression due to transformers cc @muellerzr. One thing you could do to help us fix the issue is to find the commit that broke the llama2 example in transformers with git bissect for example. Would you like to try ? Thanks !
Sry, I do not use the source code in repo to build transformers dependency. Instead I just simply use pip tool to download and install. Maybe I can not use the git bissect command to help u find the broken commit. But I just try the version one by one after v4.36.2 to find the version first turn out the same bug. It seems that when I upgrade the version to 4.38.0, It turned out the computing graph bug
(zt) root@autodl-container-d6a84aa389-74aaee7f:~/test/PiPPy/examples/llama# pip install transformers==4.38.0
Looking in indexes: http://mirrors.aliyun.com/pypi/simple
Collecting transformers==4.38.0
Downloading http://mirrors.aliyun.com/pypi/packages/91/89/5416dc364c7ef0711c564fd61a69b03d1e40eeb5c506c38e53ba8a969e79/transformers-4.38.0-py3-none-any.whl (8.5 MB)
ββββββββββββββββββββββββββββββββββββββββ 8.5/8.5 MB 10.5 MB/s eta 0:00:00
Requirement already satisfied: filelock in /root/miniconda3/envs/zt/lib/python3.10/site-packages (from transformers==4.38.0) (3.16.1)
Requirement already satisfied: huggingface-hub<1.0,>=0.19.3 in /root/miniconda3/envs/zt/lib/python3.10/site-packages (from transformers==4.38.0) (0.26.2)
Requirement already satisfied: numpy>=1.17 in /root/miniconda3/envs/zt/lib/python3.10/site-packages (from transformers==4.38.0) (2.1.3)
Requirement already satisfied: packaging>=20.0 in /root/miniconda3/envs/zt/lib/python3.10/site-packages (from transformers==4.38.0) (24.1)
Requirement already satisfied: pyyaml>=5.1 in /root/miniconda3/envs/zt/lib/python3.10/site-packages (from transformers==4.38.0) (6.0.2)
Requirement already satisfied: regex!=2019.12.17 in /root/miniconda3/envs/zt/lib/python3.10/site-packages (from transformers==4.38.0) (2024.9.11)
Requirement already satisfied: requests in /root/miniconda3/envs/zt/lib/python3.10/site-packages (from transformers==4.38.0) (2.32.3)
Requirement already satisfied: tokenizers<0.19,>=0.14 in /root/miniconda3/envs/zt/lib/python3.10/site-packages (from transformers==4.38.0) (0.15.2)
Requirement already satisfied: safetensors>=0.4.1 in /root/miniconda3/envs/zt/lib/python3.10/site-packages (from transformers==4.38.0) (0.4.5)
Requirement already satisfied: tqdm>=4.27 in /root/miniconda3/envs/zt/lib/python3.10/site-packages (from transformers==4.38.0) (4.66.6)
Requirement already satisfied: fsspec>=2023.5.0 in /root/miniconda3/envs/zt/lib/python3.10/site-packages (from huggingface-hub<1.0,>=0.19.3->transformers==4.38.0) (2024.10.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /root/miniconda3/envs/zt/lib/python3.10/site-packages (from huggingface-hub<1.0,>=0.19.3->transformers==4.38.0) (4.12.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /root/miniconda3/envs/zt/lib/python3.10/site-packages (from requests->transformers==4.38.0) (3.4.0)
Requirement already satisfied: idna<4,>=2.5 in /root/miniconda3/envs/zt/lib/python3.10/site-packages (from requests->transformers==4.38.0) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /root/miniconda3/envs/zt/lib/python3.10/site-packages (from requests->transformers==4.38.0) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in /root/miniconda3/envs/zt/lib/python3.10/site-packages (from requests->transformers==4.38.0) (2024.8.30)
Installing collected packages: transformers
Attempting uninstall: transformers
Found existing installation: transformers 4.37.2
Uninstalling transformers-4.37.2:
Successfully uninstalled transformers-4.37.2
Successfully installed transformers-4.38.0
[rank1]: The above exception was the direct cause of the following exception:
[rank1]: Traceback (most recent call last):
[rank1]: File "/root/test/PiPPy/examples/llama/pippy_llama.py", line 36, in <module>
[rank1]: pipe = pipeline(llama, mb_args=(mb_inputs["input_ids"],), split_spec=split_spec)
[rank1]: File "/root/miniconda3/envs/zt/lib/python3.10/site-packages/torch/distributed/pipelining/_IR.py", line 1231, in pipeline
[rank1]: return Pipe.from_tracing(
[rank1]: File "/root/miniconda3/envs/zt/lib/python3.10/site-packages/torch/distributed/pipelining/_IR.py", line 1045, in from_tracing
[rank1]: exported_program = Pipe._trace_with_export(
[rank1]: File "/root/miniconda3/envs/zt/lib/python3.10/site-packages/torch/distributed/pipelining/_IR.py", line 1013, in _trace_with_export
[rank1]: raise RuntimeError(
[rank1]: RuntimeError: It seems that we cannot capture your model as a full graph. Typical reasons include graph breaks, data/shape-dependent control flow, or missing meta kernels for custom operators. You can use our manual pipeline interfaces, or try to fix the graph breaks, see https://pytorch.org/docs/stable/export.html
[rank0]:[W1107 09:35:56.334439421 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
W1107 09:35:56.532652 2108 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2134 closing signal SIGTERM
E1107 09:35:56.699037 2108 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 2133) of binary: /root/miniconda3/envs/zt/bin/python
Maybe the commit that broken start at the v4.38.0
System Info
Who can help?
pipelines: @Rocketknight1 Big Model Inference: @SunMarc
Hi! I am MD students who interested in pipeline parallelism in LLM inference. I have successfully run a llama2 example in PiPPy repo. So I want to further modify this code to support Llama3 series models/, especially for Llama-3.2-3B. But when I run this code just simple modfy the path of model and tokenizer. But It turned out bug:
That's a same problem when I run this example for Llama2 model, but I fix it by degrading the version of transformers to 4.36.2. But when I use this solution for Llama3, it seems that the dependency isn't support the newest Llama model.
So how can I fix it, I am not good at fixing this bug. :(
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
1γgit the repo and install the related dependency
2γgo the llama directoty and run
pippy_llama.py
torchrun --nproc-per-node 2 pippy_llama.py
Here is the code I modify
Expected behavior
just the output for one decoding iteration of LLM