microsoft / DeepSpeed-MII

MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
Apache License 2.0
1.76k stars 163 forks source link

Attempting to flush sequence N which does not exist #497

Open aagontuk opened 1 week ago

aagontuk commented 1 week ago

I am running DeepSpeed-MII on a system with two NVIDIA A100X GPUs. I am running the following simple latency benchmark code for inference:

import math
import time
import mii

batch_size = 16

inputs = [
         "DeepSpeed is a machine learning framework",
         "He is working on",
         "He has a",
         "He got all",
         "Everyone is happy and I can",
         "The new movie that got Oscar this year",
         "In the far far distance from our galaxy,",
         "Peace is the only way"
]

inputs *= math.ceil(batch_size / len(inputs))

pipe = mii.pipeline("facebook/opt-2.7b", max_length=50)

times = []

for i in range(30):
    start = time.time()
    outputs = pipe(inputs)
    end = time.time()
    times.append(end - start)

print(f"latency: {sum(times[3:]) / len(times[3:])}")

This code works fine if I run with 1 GPU:

deepspeed --num_gpus 1 mii-bench.py

But it is throwing the mentioned error in the tittle if I run it with more than 1 GPUs:

deepspeed --num_gpus 2 mii-bench.py

Full execution log:

[2024-06-24 16:04:09,839] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.0), only 1.0.0 is known to be compatible
[2024-06-24 16:04:11,349] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-06-24 16:04:11,350] [INFO] [runner.py:568:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None mii-bench.py
[2024-06-24 16:04:12,992] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.0), only 1.0.0 is known to be compatible
[2024-06-24 16:04:14,619] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2024-06-24 16:04:14,619] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=2, node_rank=0
[2024-06-24 16:04:14,619] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2024-06-24 16:04:14,619] [INFO] [launch.py:164:main] dist_world_size=2
[2024-06-24 16:04:14,619] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2024-06-24 16:04:14,620] [INFO] [launch.py:256:main] process 766235 spawned with command: ['/usr/bin/python3', '-u', 'mii-bench.py', '--local_rank=0']
[2024-06-24 16:04:14,621] [INFO] [launch.py:256:main] process 766236 spawned with command: ['/usr/bin/python3', '-u', 'mii-bench.py', '--local_rank=1']
[2024-06-24 16:04:16,232] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-24 16:04:16,324] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.0), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.0), only 1.0.0 is known to be compatible
[2024-06-24 16:04:17,824] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-06-24 16:04:17,953] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-06-24 16:04:17,953] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
/home/rahamanm/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/home/rahamanm/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(

Fetching 6 files:   0%|          | 0/6 [00:00<?, ?it/s]
Fetching 6 files: 100%|██████████| 6/6 [00:00<00:00, 90200.09it/s]

Fetching 6 files:   0%|          | 0/6 [00:00<?, ?it/s]
Fetching 6 files: 100%|██████████| 6/6 [00:00<00:00, 79387.46it/s]
[2024-06-24 16:04:19,472] [INFO] [engine_v2.py:82:__init__] Building model...
Using /home/rahamanm/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/rahamanm/.cache/torch_extensions/py310_cu121/inference_core_ops/build.ninja...
/home/rahamanm/.local/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
Building extension module inference_core_ops...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module inference_core_ops...
Time to load inference_core_ops op: 0.10766816139221191 seconds
[2024-06-24 16:04:19,763] [INFO] [engine_v2.py:82:__init__] Building model...
Using /home/rahamanm/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/rahamanm/.cache/torch_extensions/py310_cu121/inference_core_ops/build.ninja...
/home/rahamanm/.local/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
Building extension module inference_core_ops...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Using /home/rahamanm/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
ninja: no work to do.
Loading extension module inference_core_ops...
Time to load inference_core_ops op: 0.11362981796264648 seconds
Detected CUDA files, patching ldflags
Emitting ninja build file /home/rahamanm/.cache/torch_extensions/py310_cu121/ragged_device_ops/build.ninja...
/home/rahamanm/.local/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
Building extension module ragged_device_ops...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Using /home/rahamanm/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
ninja: no work to do.
Loading extension module ragged_device_ops...
Time to load ragged_device_ops op: 0.09276843070983887 seconds
Using /home/rahamanm/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/rahamanm/.cache/torch_extensions/py310_cu121/ragged_ops/build.ninja...
/home/rahamanm/.local/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
Building extension module ragged_ops...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Loading extension module ragged_device_ops...
Time to load ragged_device_ops op: 0.10586166381835938 seconds
Using /home/rahamanm/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
ninja: no work to do.
Loading extension module ragged_ops...
Time to load ragged_ops op: 0.08362889289855957 seconds
[2024-06-24 16:04:20,046] [INFO] [huggingface_engine.py:109:parameters] Loading checkpoint: /home/rahamanm/.cache/huggingface/hub/models--facebook--opt-2.7b/snapshots/905a4b602cda5c501f1b3a2650a4152680238254/pytorch_model.bin
Loading extension module ragged_ops...
Time to load ragged_ops op: 0.10570263862609863 seconds
[2024-06-24 16:04:20,133] [INFO] [huggingface_engine.py:109:parameters] Loading checkpoint: /home/rahamanm/.cache/huggingface/hub/models--facebook--opt-2.7b/snapshots/905a4b602cda5c501f1b3a2650a4152680238254/pytorch_model.bin
[2024-06-24 16:04:23,615] [INFO] [engine_v2.py:84:__init__] Model built.
[2024-06-24 16:04:23,810] [INFO] [engine_v2.py:84:__init__] Model built.
[2024-06-24 16:04:24,213] [INFO] [kv_cache.py:135:__init__] Allocating KV-cache 0 with shape: (32, 7647, 64, 2, 16, 80) consisting of 7647 blocks.
[2024-06-24 16:04:24,213] [INFO] [kv_cache.py:135:__init__] Allocating KV-cache 0 with shape: (32, 7647, 64, 2, 16, 80) consisting of 7647 blocks.
[2024-06-24 16:04:26,561] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 1 which does not exist.
[2024-06-24 16:04:27,269] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 9 which does not exist.
[2024-06-24 16:04:27,950] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 1 which does not exist.
[2024-06-24 16:04:28,504] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 1 which does not exist.
[2024-06-24 16:04:28,505] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 9 which does not exist.
[2024-06-24 16:04:29,055] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 1 which does not exist.
[2024-06-24 16:04:29,055] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 9 which does not exist.
[2024-06-24 16:04:29,606] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 1 which does not exist.
[2024-06-24 16:04:29,606] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 9 which does not exist.
[2024-06-24 16:04:30,148] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 9 which does not exist.
[2024-06-24 16:04:30,688] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 1 which does not exist.
[2024-06-24 16:04:31,227] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 1 which does not exist.
[2024-06-24 16:04:31,770] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 9 which does not exist.
[2024-06-24 16:04:32,312] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 9 which does not exist.
[2024-06-24 16:04:32,863] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 1 which does not exist.
[2024-06-24 16:04:32,863] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 2 which does not exist.
[2024-06-24 16:04:32,863] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 9 which does not exist.
[2024-06-24 16:04:32,863] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 10 which does not exist.
[2024-06-24 16:04:33,407] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 9 which does not exist.
[2024-06-24 16:04:33,954] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 1 which does not exist.
[2024-06-24 16:04:33,954] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 3 which does not exist.
[2024-06-24 16:04:34,496] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 1 which does not exist.
[2024-06-24 16:04:34,496] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 9 which does not exist.
[2024-06-24 16:04:35,034] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 9 which does not exist.
[2024-06-24 16:04:35,573] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 1 which does not exist.
[2024-06-24 16:04:35,574] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 9 which does not exist.
[2024-06-24 16:04:36,125] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 9 which does not exist.
[2024-06-24 16:04:36,669] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 9 which does not exist.
[2024-06-24 16:04:37,208] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 1 which does not exist.
[2024-06-24 16:04:37,208] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 9 which does not exist.
[2024-06-24 16:04:37,747] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 1 which does not exist.
[2024-06-24 16:04:38,830] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 1 which does not exist.
[2024-06-24 16:04:38,831] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 9 which does not exist.
[2024-06-24 16:04:39,370] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 1 which does not exist.
[2024-06-24 16:04:39,909] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 9 which does not exist.
[2024-06-24 16:04:40,982] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 1 which does not exist.
[2024-06-24 16:04:40,982] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 9 which does not exist.
[2024-06-24 16:04:41,522] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 1 which does not exist.
[2024-06-24 16:04:42,061] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 1 which does not exist.
[2024-06-24 16:04:42,061] [WARNING] [ragged_manager.py:115:flush_sequence] Attempting to flush sequence 9 which does not exist.
latency: 0.542697650414926
[2024-06-24 16:04:44,653] [INFO] [launch.py:351:main] Process 766235 exits successfully.
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/rahamanm/repos/DeepSpeedExamples/inference/huggingface/text-generation/mii-bench.py", line 26, in <module>
[rank1]:     outputs = pipe(inputs)
[rank1]:   File "/home/rahamanm/.local/lib/python3.10/site-packages/mii/batching/ragged_batching.py", line 570, in __call__
[rank1]:     self.schedule_requests()
[rank1]:   File "/home/rahamanm/.local/lib/python3.10/site-packages/mii/batching/ragged_batching.py", line 335, in schedule_requests
[rank1]:     self.reset_request_status()
[rank1]:   File "/home/rahamanm/.local/lib/python3.10/site-packages/mii/batching/ragged_batching.py", line 360, in reset_request_status
[rank1]:     assert last_r is not None, "Function to clear the KV cache is invoked, but no request consumes KV cache"
[rank1]: AssertionError: Function to clear the KV cache is invoked, but no request consumes KV cache
[2024-06-24 16:05:19,690] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 766235
[2024-06-24 16:05:19,691] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 766236
[2024-06-24 16:05:19,691] [ERROR] [launch.py:325:sigkill_handler] ['/usr/bin/python3', '-u', 'mii-bench.py', '--local_rank=1'] exits with return code = 1