RuntimeError: Number of ranks is larger than number of stages, some ranks are unused

sayakpaul commented 8 months ago

TL;DR: Running into the following error when trying to perform inference with a UNet from diffusers with PiPPY:

Using RTX 4000 series which doesn't support faster communication speedups. Ensuring P2P and IB communications are disabled.
Traceback (most recent call last):
  File "/home/sayak/diffusers/check_pippy_diffusers.py", line 18, in <module>
    unet = prepare_pippy(unet, split_points="auto", example_kwargs=(inputs))
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/accelerate/inference.py", line 161, in prepare_pippy
    stage = build_pipeline(model, split_points, example_args, example_kwargs, num_chunks)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/accelerate/inference.py", line 83, in build_pipeline
    stage = PipelineStage(pipe, state.local_process_index, device=state.device)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/pippy/PipelineStage.py", line 71, in __init__
    raise RuntimeError(
RuntimeError: Number of ranks is larger than number of stages, some ranks are unused
Traceback (most recent call last):
  File "/home/sayak/diffusers/check_pippy_diffusers.py", line 18, in <module>
    unet = prepare_pippy(unet, split_points="auto", example_kwargs=(inputs))
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/accelerate/inference.py", line 161, in prepare_pippy
    stage = build_pipeline(model, split_points, example_args, example_kwargs, num_chunks)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/accelerate/inference.py", line 83, in build_pipeline
    stage = PipelineStage(pipe, state.local_process_index, device=state.device)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/pippy/PipelineStage.py", line 71, in __init__
    raise RuntimeError(
RuntimeError: Number of ranks is larger than number of stages, some ranks are unused
[2024-02-27 09:31:58,685] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2847563) of binary: /home/sayak/.pyenv/versions/3.10.12/envs/diffusers/bin/python3.10
Traceback (most recent call last):
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1016, in launch_command
    multi_gpu_launcher(args)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/accelerate/commands/launch.py", line 672, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
check_pippy_diffusers.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-02-27_09:31:58
  host      : audace
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 2847564)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-02-27_09:31:58
  host      : audace
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2847563)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Setup

watch nvidia-smi:

Tue Feb 27 09:33:18 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:01:00.0 Off |                  Off |
|  0%   52C    P5              50W / 600W |      3MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        On  | 00000000:13:00.0 Off |                  Off |
|  0%   52C    P5              46W / 600W |      3MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

accelerate config:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

diffusers-cli env:

- `diffusers` version: 0.27.0.dev0
- Platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- PyTorch version (GPU?): 2.2.0+cu121 (True)
- Huggingface_hub version: 0.20.2
- Transformers version: 4.39.0.dev0
- Accelerate version: 0.28.0.dev0
- xFormers version: 0.0.24
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: Yes (I guess)

Script

from diffusers import UNet2DConditionModel
from accelerate import PartialState, prepare_pippy
import torch 
import time

unet = UNet2DConditionModel.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    subfolder="unet",
    variant="fp16",
    torch_dtype=torch.float16,
).eval()

inputs = {
    "sample": torch.randn(1, 4, 64, 64, dtype=torch.float16),
    "encoder_hidden_states": torch.randn(1, 77, 768, dtype=torch.float16),
    "timestep": torch.randint(0, 1000, size=(1, ))
}
unet = prepare_pippy(unet, split_points="auto", example_kwargs=(inputs))

# Move the inputs to the first device
inputs = {k: v.to("cuda:0") for k, v in inputs.items()}

# Take an average of 5 times
# Measure first batch
torch.cuda.synchronize()
start_time = time.time()
with torch.no_grad():
    output = unet(**inputs)
torch.cuda.synchronize()
end_time = time.time()
first_batch = end_time - start_time

# Now that CUDA is init, measure after
torch.cuda.synchronize()
start_time = time.time()
for i in range(5):
    with torch.no_grad():
        output = unet(**inputs)
torch.cuda.synchronize()
end_time = time.time()

# The outputs are only on the final process by default
if PartialState().is_last_process:
    output = torch.stack(tuple(output.sample))
    print(f"Time of first pass: {first_batch}")
    print(f"Average time per batch: {(end_time - start_time)/5}")

Run using:

accelerate launch check_pippy_diffusers.py

muellerzr commented 8 months ago

FWD'ing to @kwen2501, I think this is an issue with unets not working via splitting the model on tracing? (I tried a large number of different split points)

github-actions[bot] commented 7 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

sayakpaul commented 6 months ago

Re-opening it as I don't think this is completed?

github-actions[bot] commented 5 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / accelerate

RuntimeError: Number of ranks is larger than number of stages, some ranks are unused #2497

Setup

Script