huggingface / transformers

πŸ€— Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.37k stars 27.09k forks source link

AssertionError for Pytorch PiPPy example #34600

Open Noblezhong opened 2 weeks ago

Noblezhong commented 2 weeks ago

System Info

(zt) root@autodl-container-7071118252-7032359d:~/test/PiPPy/examples/llama# transformers-cli env

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

- `transformers` version: 4.44.0
- Platform: Linux-5.4.0-126-generic-x86_64-with-glibc2.35
- Python version: 3.10.0
- Huggingface_hub version: 0.26.2
- Safetensors version: 0.4.5
- Accelerate version: not installed
- Accelerate config: not found
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: <fill in>
- Using GPU in script?: <fill in>
- GPU type: NVIDIA GeForce RTX 3090

Who can help?

pipelines: @Rocketknight1 Big Model Inference: @SunMarc

Hi! I am MD students who interested in pipeline parallelism in LLM inference. I have successfully run a llama2 example in PiPPy repo. So I want to further modify this code to support Llama3 series models/, especially for Llama-3.2-3B. But when I run this code just simple modfy the path of model and tokenizer. But It turned out bug:

(zt) root@autodl-container-7071118252-7032359d:~/test/PiPPy/examples/llama# torchrun --nproc-per-node 2 pippy_llama.py
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:02<00:00,  1.09s/it]
LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 3072)
    (layers): ModuleList(
      (0-27): 28 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (k_proj): Linear(in_features=3072, out_features=1024, bias=False)
          (v_proj): Linear(in_features=3072, out_features=1024, bias=False)
          (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=3072, out_features=8192, bias=False)
          (up_proj): Linear(in_features=3072, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((3072,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((3072,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((3072,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=3072, out_features=128256, bias=False)
)
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:02<00:00,  1.15s/it]
LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 3072)
    (layers): ModuleList(
      (0-27): 28 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (k_proj): Linear(in_features=3072, out_features=1024, bias=False)
          (v_proj): Linear(in_features=3072, out_features=1024, bias=False)
          (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=3072, out_features=8192, bias=False)
          (up_proj): Linear(in_features=3072, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((3072,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((3072,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((3072,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=3072, out_features=128256, bias=False)
)
layers_per_rank = 14
layers_per_rank = 14
[rank0]: Traceback (most recent call last):
[rank0]:   File "/root/test/PiPPy/examples/llama/pippy_llama.py", line 36, in <module>
[rank0]:     pipe = pipeline(llama, mb_args=(mb_inputs["input_ids"],))
[rank0]:   File "/root/miniconda3/envs/zt/lib/python3.10/site-packages/torch/distributed/pipelining/_IR.py", line 1238, in pipeline
[rank0]:     return Pipe.from_tracing(
[rank0]:   File "/root/miniconda3/envs/zt/lib/python3.10/site-packages/torch/distributed/pipelining/_IR.py", line 1051, in from_tracing
[rank0]:     pipe = Pipe._from_traced(
[rank0]:   File "/root/miniconda3/envs/zt/lib/python3.10/site-packages/torch/distributed/pipelining/_IR.py", line 750, in _from_traced
[rank0]:     new_submod = _outline_submodules(submodule.graph)
[rank0]:   File "/root/miniconda3/envs/zt/lib/python3.10/site-packages/torch/distributed/pipelining/_unflatten.py", line 24, in _outline_submodules
[rank0]:     ).run_outer()
[rank0]:   File "/root/miniconda3/envs/zt/lib/python3.10/site-packages/torch/export/unflatten.py", line 1014, in run_outer
[rank0]:     self.run_from(node_idx)
[rank0]:   File "/root/miniconda3/envs/zt/lib/python3.10/site-packages/torch/export/unflatten.py", line 1094, in run_from
[rank0]:     ).run_from(node_idx)
[rank0]:   File "/root/miniconda3/envs/zt/lib/python3.10/site-packages/torch/export/unflatten.py", line 1094, in run_from
[rank0]:     ).run_from(node_idx)
[rank0]:   File "/root/miniconda3/envs/zt/lib/python3.10/site-packages/torch/export/unflatten.py", line 1043, in run_from
[rank0]:     self.finalize_outputs()
[rank0]:   File "/root/miniconda3/envs/zt/lib/python3.10/site-packages/torch/export/unflatten.py", line 993, in finalize_outputs
[rank0]:     _verify_graph_equivalence(self.cached_graph_module, self.module)
[rank0]:   File "/root/miniconda3/envs/zt/lib/python3.10/site-packages/torch/export/unflatten.py", line 655, in _verify_graph_equivalence
[rank0]:     assert graph_dump(x.graph) == graph_dump(y.graph)
[rank0]: AssertionError
[rank0]:[W1104 21:21:40.765172753 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
[rank1]: Traceback (most recent call last):
[rank1]:   File "/root/test/PiPPy/examples/llama/pippy_llama.py", line 36, in <module>
[rank1]:     pipe = pipeline(llama, mb_args=(mb_inputs["input_ids"],))
[rank1]:   File "/root/miniconda3/envs/zt/lib/python3.10/site-packages/torch/distributed/pipelining/_IR.py", line 1238, in pipeline
[rank1]:     return Pipe.from_tracing(
[rank1]:   File "/root/miniconda3/envs/zt/lib/python3.10/site-packages/torch/distributed/pipelining/_IR.py", line 1051, in from_tracing
[rank1]:     pipe = Pipe._from_traced(
[rank1]:   File "/root/miniconda3/envs/zt/lib/python3.10/site-packages/torch/distributed/pipelining/_IR.py", line 750, in _from_traced
[rank1]:     new_submod = _outline_submodules(submodule.graph)
[rank1]:   File "/root/miniconda3/envs/zt/lib/python3.10/site-packages/torch/distributed/pipelining/_unflatten.py", line 24, in _outline_submodules
[rank1]:     ).run_outer()
[rank1]:   File "/root/miniconda3/envs/zt/lib/python3.10/site-packages/torch/export/unflatten.py", line 1014, in run_outer
[rank1]:     self.run_from(node_idx)
[rank1]:   File "/root/miniconda3/envs/zt/lib/python3.10/site-packages/torch/export/unflatten.py", line 1094, in run_from
[rank1]:     ).run_from(node_idx)
[rank1]:   File "/root/miniconda3/envs/zt/lib/python3.10/site-packages/torch/export/unflatten.py", line 1094, in run_from
[rank1]:     ).run_from(node_idx)
[rank1]:   File "/root/miniconda3/envs/zt/lib/python3.10/site-packages/torch/export/unflatten.py", line 1043, in run_from
[rank1]:     self.finalize_outputs()
[rank1]:   File "/root/miniconda3/envs/zt/lib/python3.10/site-packages/torch/export/unflatten.py", line 993, in finalize_outputs
[rank1]:     _verify_graph_equivalence(self.cached_graph_module, self.module)
[rank1]:   File "/root/miniconda3/envs/zt/lib/python3.10/site-packages/torch/export/unflatten.py", line 655, in _verify_graph_equivalence
[rank1]:     assert graph_dump(x.graph) == graph_dump(y.graph)
[rank1]: AssertionError
W1104 21:21:41.688867 2513 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2540 closing signal SIGTERM
E1104 21:21:42.054025 2513 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 2539) of binary: /root/miniconda3/envs/zt/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/zt/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/zt/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/root/miniconda3/envs/zt/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/root/miniconda3/envs/zt/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/root/miniconda3/envs/zt/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/zt/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
pippy_llama.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-11-04_21:21:41
  host      : autodl-container-7071118252-7032359d
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2539)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

That's a same problem when I run this example for Llama2 model, but I fix it by degrading the version of transformers to 4.36.2. But when I use this solution for Llama3, it seems that the dependency isn't support the newest Llama model.

zt) root@autodl-container-7071118252-7032359d:~/test/PiPPy/examples/llama# torchrun --nproc-per-node 2 pippy_llama.py
Traceback (most recent call last):
  File "/root/test/PiPPy/examples/llama/pippy_llama.py", line 8, in <module>
    llama = AutoModelForCausalLM.from_pretrained(
  File "/root/miniconda3/envs/zt/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 526, in from_pretrained
    config, kwargs = AutoConfig.from_pretrained(
  File "/root/miniconda3/envs/zt/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1124, in from_pretrained
    return config_class.from_dict(config_dict, **unused_kwargs)
  File "/root/miniconda3/envs/zt/lib/python3.10/site-packages/transformers/configuration_utils.py", line 764, in from_dict
    config = cls(**config_dict)
  File "/root/miniconda3/envs/zt/lib/python3.10/site-packages/transformers/models/llama/configuration_llama.py", line 160, in __init__
    self._rope_scaling_validation()
  File "/root/miniconda3/envs/zt/lib/python3.10/site-packages/transformers/models/llama/configuration_llama.py", line 180, in _rope_scaling_validation
    raise ValueError(
ValueError: `rope_scaling` must be a dictionary with with two fields, `type` and `factor`, got {'factor': 32.0, 'high_freq_factor': 4.0, 'low_freq_factor': 1.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}

So how can I fix it, I am not good at fixing this bug. :(

Information

Tasks

Reproduction

1、git the repo and install the related dependency

git cllone https://github.com/pytorch/PiPPy.git
pip install -r requirements.txt

2、go the llama directoty and runpippy_llama.py

torchrun --nproc-per-node 2 pippy_llama.py

Here is the code I modify

# $ torchrun --nproc-per-node 4 pippy_llama.py
import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from torch.distributed.pipelining import SplitPoint, pipeline, ScheduleGPipe

# Grab the model
llama = AutoModelForCausalLM.from_pretrained(
    "/root/autodl-tmp/model/Llama-3.2-3B", local_files_only= True
)
print(llama)

tokenizer = AutoTokenizer.from_pretrained("/root/autodl-tmp/model/Llama-3.2-3B", local_files_only= True)
tokenizer.pad_token = tokenizer.eos_token
mb_prompts = (
    "How do you", "I like to",
)  # microbatch size = 2

rank = int(os.environ["RANK"])
world_size = int(os.environ["WORLD_SIZE"])
device = torch.device(f"cuda:{rank % torch.cuda.device_count()}")
torch.distributed.init_process_group(rank=rank, world_size=world_size)

llama.to(device).eval()

# Cut model by equal number of layers per rank
layers_per_rank = llama.config.num_hidden_layers // world_size
print(f"layers_per_rank = {layers_per_rank}")
split_spec = {
    f"model.layers.{i * layers_per_rank}": SplitPoint.BEGINNING
    for i in range(1, world_size)
}

# Create a pipeline representation from the model
mb_inputs = tokenizer(mb_prompts, return_tensors="pt", padding=True).to(device)
pipe = pipeline(llama, mb_args=(mb_inputs["input_ids"],))

# Create pipeline stage for each rank
stage = pipe.build_stage(rank, device=device)

# Run time inputs
full_batch_prompts = (
    "How do you", "I like to", "Can I help", "You need to",
    "The weather is", "I found a", "What is your", "You are so",
)  # full batch size = 8
inputs = tokenizer(full_batch_prompts, return_tensors="pt", padding=True).to(device)

# Attach to a schedule
# number of microbatches = 8 // 2 = 4
num_mbs = 4
schedule = ScheduleGPipe(stage, num_mbs)

# Run
if rank == 0:
    args = inputs["input_ids"]
else:
    args = None

output = schedule.step(args)

# Decode
if output is not None:
    next_token_logits = output[0][:, -1, :]
    next_token = torch.argmax(next_token_logits, dim=-1)
    print(tokenizer.batch_decode(next_token))

Expected behavior

just the output for one decoding iteration of LLM

Outputs:
['make', 'think', 'you', 'be', 'getting', 'great', 'favorite', 'right']
SunMarc commented 2 weeks ago

Thanks for reporting ! We might have some regression due to transformers cc @muellerzr. One thing you could do to help us fix the issue is to find the commit that broke the llama2 example in transformers with git bissect for example. Would you like to try ? Thanks !

Noblezhong commented 2 weeks ago

Sry, I do not use the source code in repo to build transformers dependency. Instead I just simply use pip tool to download and install. Maybe I can not use the git bissect command to help u find the broken commit. But I just try the version one by one after v4.36.2 to find the version first turn out the same bug. It seems that when I upgrade the version to 4.38.0, It turned out the computing graph bug

(zt) root@autodl-container-d6a84aa389-74aaee7f:~/test/PiPPy/examples/llama# pip install transformers==4.38.0
Looking in indexes: http://mirrors.aliyun.com/pypi/simple
Collecting transformers==4.38.0
  Downloading http://mirrors.aliyun.com/pypi/packages/91/89/5416dc364c7ef0711c564fd61a69b03d1e40eeb5c506c38e53ba8a969e79/transformers-4.38.0-py3-none-any.whl (8.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.5/8.5 MB 10.5 MB/s eta 0:00:00
Requirement already satisfied: filelock in /root/miniconda3/envs/zt/lib/python3.10/site-packages (from transformers==4.38.0) (3.16.1)
Requirement already satisfied: huggingface-hub<1.0,>=0.19.3 in /root/miniconda3/envs/zt/lib/python3.10/site-packages (from transformers==4.38.0) (0.26.2)
Requirement already satisfied: numpy>=1.17 in /root/miniconda3/envs/zt/lib/python3.10/site-packages (from transformers==4.38.0) (2.1.3)
Requirement already satisfied: packaging>=20.0 in /root/miniconda3/envs/zt/lib/python3.10/site-packages (from transformers==4.38.0) (24.1)
Requirement already satisfied: pyyaml>=5.1 in /root/miniconda3/envs/zt/lib/python3.10/site-packages (from transformers==4.38.0) (6.0.2)
Requirement already satisfied: regex!=2019.12.17 in /root/miniconda3/envs/zt/lib/python3.10/site-packages (from transformers==4.38.0) (2024.9.11)
Requirement already satisfied: requests in /root/miniconda3/envs/zt/lib/python3.10/site-packages (from transformers==4.38.0) (2.32.3)
Requirement already satisfied: tokenizers<0.19,>=0.14 in /root/miniconda3/envs/zt/lib/python3.10/site-packages (from transformers==4.38.0) (0.15.2)
Requirement already satisfied: safetensors>=0.4.1 in /root/miniconda3/envs/zt/lib/python3.10/site-packages (from transformers==4.38.0) (0.4.5)
Requirement already satisfied: tqdm>=4.27 in /root/miniconda3/envs/zt/lib/python3.10/site-packages (from transformers==4.38.0) (4.66.6)
Requirement already satisfied: fsspec>=2023.5.0 in /root/miniconda3/envs/zt/lib/python3.10/site-packages (from huggingface-hub<1.0,>=0.19.3->transformers==4.38.0) (2024.10.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /root/miniconda3/envs/zt/lib/python3.10/site-packages (from huggingface-hub<1.0,>=0.19.3->transformers==4.38.0) (4.12.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /root/miniconda3/envs/zt/lib/python3.10/site-packages (from requests->transformers==4.38.0) (3.4.0)
Requirement already satisfied: idna<4,>=2.5 in /root/miniconda3/envs/zt/lib/python3.10/site-packages (from requests->transformers==4.38.0) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /root/miniconda3/envs/zt/lib/python3.10/site-packages (from requests->transformers==4.38.0) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in /root/miniconda3/envs/zt/lib/python3.10/site-packages (from requests->transformers==4.38.0) (2024.8.30)
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.37.2
    Uninstalling transformers-4.37.2:
      Successfully uninstalled transformers-4.37.2
Successfully installed transformers-4.38.0
[rank1]: The above exception was the direct cause of the following exception:

[rank1]: Traceback (most recent call last):
[rank1]:   File "/root/test/PiPPy/examples/llama/pippy_llama.py", line 36, in <module>
[rank1]:     pipe = pipeline(llama, mb_args=(mb_inputs["input_ids"],), split_spec=split_spec)
[rank1]:   File "/root/miniconda3/envs/zt/lib/python3.10/site-packages/torch/distributed/pipelining/_IR.py", line 1231, in pipeline
[rank1]:     return Pipe.from_tracing(
[rank1]:   File "/root/miniconda3/envs/zt/lib/python3.10/site-packages/torch/distributed/pipelining/_IR.py", line 1045, in from_tracing
[rank1]:     exported_program = Pipe._trace_with_export(
[rank1]:   File "/root/miniconda3/envs/zt/lib/python3.10/site-packages/torch/distributed/pipelining/_IR.py", line 1013, in _trace_with_export
[rank1]:     raise RuntimeError(
[rank1]: RuntimeError: It seems that we cannot capture your model as a full graph. Typical reasons include graph breaks, data/shape-dependent control flow, or missing meta kernels for custom operators. You can use our manual pipeline interfaces, or try to fix the graph breaks, see https://pytorch.org/docs/stable/export.html
[rank0]:[W1107 09:35:56.334439421 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
W1107 09:35:56.532652 2108 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2134 closing signal SIGTERM
E1107 09:35:56.699037 2108 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 2133) of binary: /root/miniconda3/envs/zt/bin/python

Maybe the commit that broken start at the v4.38.0