[BUG] OPT-66B: OOM at reasonable inference sizes

aws-stdun commented 1 year ago

Describe the bug I am able to use DeepSpeed to perform inference at sequence lengths 128 and 256, up to batch size 16 for both. Beyond batch size 16 and/or sequence length 256, I hit an error (which may be triggered by OOM, although it is unclear).

Here is the relevant portion of the stacktrace for seq lens 128/256 @ batch size 32:

    outputs = self.model.decoder(  File "/home/ubuntu/stdun/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)  File "/home/ubuntu/stdun/venv/lib/python3.9/site-packages/transformers/models/opt/modeling_opt.py", line 697, in forward
    layer_outputs = decoder_layer(  File "/home/ubuntu/stdun/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)  File "/home/ubuntu/stdun/DeepSpeed/deepspeed/model_implementations/transformers/ds_transformer.py", line 123, in forward
    self.allocate_workspace(self.config.hidden_size,RuntimeError: Workspace is null.

Using a sequence length of 512---even at batch size 1---produces a clear OOM:

  File "/home/ubuntu/stdun/DeepSpeed/deepspeed/ops/transformer/inference/ds_mlp.py", line 45, in __init__
    self.output_w = nn.Parameter(torch.empty(intm_size_per_partition,
RuntimeError: CUDA out of memory. Tried to allocate 648.00 MiB (GPU 0; 39.41 GiB total capacity; 38.29 GiB already allocated; 62.50 MiB free; 38.45 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Furthermore, the initial results from smaller sequence lengths show subpar performance compared with DeepSpeed 0.5.9. The latency is almost 2x what it was before (even with the prior bugs). This is using DeepSpeed built from source, v0.8.1+d59b5729

To Reproduce

Below is an example script infer.py. To reproduce a specific configuration failure, adjust these lines:

    for max_len in (128, 256, 512, 1024, 2048):
        for batch_size in (1, 2, 4, 8, 16, 32):

infer.py

import os
import time

import deepspeed
import torch
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

def infer():
    n_infers = 10
    local_rank = int(os.getenv("LOCAL_RANK", "0"))
    world_size = int(os.getenv("WORLD_SIZE", "1"))
    model_org = "facebook"
    model_name = "opt-66b"
    hf_name = f"{model_org}/{model_name}"

    print("Loading model...")
    st = time.time()
    model = AutoModelForCausalLM.from_pretrained(hf_name, torch_dtype=torch.float16, low_cpu_mem_usage=True)
    tokenizer = AutoTokenizer.from_pretrained(hf_name, use_fast=False)
    print(f"Finished in {round((time.time() - st) / 60, 2)} mins")

    print("Splitting model...")
    st = time.time()
    model = deepspeed.init_inference(model,
        mp_size=world_size,
        dtype=model.dtype,
        replace_with_kernel_inject=True,
        replace_method="auto",
    )
    print(os.environ)
    generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer, device=local_rank)
    print(f"Finished in {round((time.time() - st) / 60, 2)} mins")

    for max_len in (128, 256, 512, 1024, 2048):
        for batch_size in (1, 2, 4, 8, 16, 32):
            for _ in range(n_infers):
                generator(
                    "our story begins",
                    do_sample=True,
                    max_length=max_len,
                    num_return_sequences=batch_size,
                )

if __name__ == "__main__":
    with torch.inference_mode():
        infer()

To use:

deepspeed --num_gpus 8 infer.py

Expected behavior Expected this to mostly work, failing for the largest batch sizes / sequence lengths.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
/home/ubuntu/stdun/venv/lib/python3.9/site-packages/setuptools/distutils_patch.py:25: UserWarning: Distutils was imported before Setuptools. This usage is discouraged and may exhibit undesirable behaviors or errors. Please use Setuptools' objects directly or at least import Setuptools first.
  warnings.warn(
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/ubuntu/stdun/venv/lib/python3.9/site-packages/torch']
torch version .................... 1.12.1+cu116
deepspeed install path ........... ['/home/ubuntu/stdun/DeepSpeed/deepspeed']
deepspeed info ................... 0.8.1+d59b5729, d59b5729, master
torch cuda version ............... 11.6
torch hip version ................ None
nvcc version ..................... 11.2
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.6

Screenshots N/A

System info (please complete the following information):

OS: Ubuntu 20.04.4 LTS
GPU count and types: x8 A100s
DeepSpeed: 0.8.1+d59b5729
Transformers: 4.26.0
Accelerate: 0.15.0
Python version: 3.9.4

Docker context N/A

Additional context N/A

Wenhan-Tan commented 1 year ago

Hi @aws-stdun @mrwyattii , I've also met this error self.allocate_workspace(self.config.hidden_size,RuntimeError: Workspace is null.. What does this error mean? I checked both RAM and GPU memory and they're not full, so I guess it's not a memory issue. What else can I do to debug?

mrwyattii commented 1 year ago

@aws-stdun Thanks for reporting this issue. I've slightly modified the code you provided to print out the memory usage at each sequence length and batch size. Could you please run again and share the output to help me debug the error you are seeing? Thanks!

import os
import time

import deepspeed
import torch
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
from deepspeed.runtime.utils import see_memory_usage

def infer():
    n_infers = 1
    local_rank = int(os.getenv("LOCAL_RANK", "0"))
    world_size = int(os.getenv("WORLD_SIZE", "1"))
    model_org = "facebook"
    model_name = "opt-66b"
    hf_name = f"{model_org}/{model_name}"

    print("Loading model...")
    st = time.time()
    model = AutoModelForCausalLM.from_pretrained(hf_name, torch_dtype=torch.float16, low_cpu_mem_usage=True)
    tokenizer = AutoTokenizer.from_pretrained(hf_name, use_fast=False)
    print(f"Finished in {round((time.time() - st) / 60, 2)} mins")

    print("Splitting model...")
    st = time.time()
    model = deepspeed.init_inference(model,
        mp_size=world_size,
        dtype=model.dtype,
        replace_with_kernel_inject=True,
        replace_method="auto",
        max_out_tokens=2048,
    )
    print(os.environ)
    generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer, device=local_rank)
    print(f"Finished in {round((time.time() - st) / 60, 2)} mins")

    for max_len in (128, 256, 512, 1024, 2048):
        for batch_size in (1, 2, 4, 8, 16, 32):
            for _ in range(n_infers):
                generator(
                    "our story begins",
                    do_sample=True,
                    max_length=max_len,
                    num_return_sequences=batch_size,
                )
            see_memory_usage(f"max_len:{max_len}, batch_size:{batch_size}", force=True)

if __name__ == "__main__":
    with torch.inference_mode():
        infer()

Also, in regard to the performance degradation. I'm unable to get the OPT models running with the older version of deepspeed that you specified. Are you using the same script for that older version? How are you measuring latency?

mrwyattii commented 1 year ago

Hi @aws-stdun @mrwyattii , I've also met this error self.allocate_workspace(self.config.hidden_size,RuntimeError: Workspace is null.. What does this error mean? I checked both RAM and GPU memory and they're not full, so I guess it's not a memory issue. What else can I do to debug?

@Wenhan-Tan what model are you running and on what hardware?

Wenhan-Tan commented 1 year ago

Hi @aws-stdun @mrwyattii , I've also met this error self.allocate_workspace(self.config.hidden_size,RuntimeError: Workspace is null.. What does this error mean? I checked both RAM and GPU memory and they're not full, so I guess it's not a memory issue. What else can I do to debug?

@Wenhan-Tan what model are you running and on what hardware?

Hi @mrwyattii , I was running GPT3 on A100 40GB. BTW, the error is gone when I reduce batch size, so I believe it was limited by the GPU memory although it didn't show any CUDA OOM messages.

aws-stdun commented 1 year ago

@mrwyattii I ran the code you provided, but I don't see any output when using deepspeed runner. Is there more to it? Btw, this issue still exists in v0.8.3 (can't even run batch size=1 with OPT-66)

mrwyattii commented 1 year ago

@Wenhan-Tan @aws-stdun I know it has been many months, but if you are still seeing errors with the OPT model:

I just tested the code that @aws-stdun provided with the latest DeepSpeed and found that the problem is caused by num_return_sequences=batch_size. Previously, I had assumed that transformers would handle this similar to how it handles the case where we pass multiple inputs into a pipeline object, where under the hood, it is essentially running in a for loop over all inputs. But that is not the case and it would appear that the current implementation of DeepSpeed-Inference is not compatible with this option (similar to how we are not compatible with num_beams>1).

Wenhan-Tan commented 1 year ago

@mrwyattii Hi, I haven't tested DeepSpeed since last time I replied to this bug

microsoft / DeepSpeed

[BUG] OPT-66B: OOM at reasonable inference sizes #2747

infer.py