Open aws-stdun opened 1 year ago
Hi @aws-stdun @mrwyattii , I've also met this error self.allocate_workspace(self.config.hidden_size,RuntimeError: Workspace is null.
. What does this error mean? I checked both RAM and GPU memory and they're not full, so I guess it's not a memory issue. What else can I do to debug?
@aws-stdun Thanks for reporting this issue. I've slightly modified the code you provided to print out the memory usage at each sequence length and batch size. Could you please run again and share the output to help me debug the error you are seeing? Thanks!
import os
import time
import deepspeed
import torch
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
from deepspeed.runtime.utils import see_memory_usage
def infer():
n_infers = 1
local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = int(os.getenv("WORLD_SIZE", "1"))
model_org = "facebook"
model_name = "opt-66b"
hf_name = f"{model_org}/{model_name}"
print("Loading model...")
st = time.time()
model = AutoModelForCausalLM.from_pretrained(hf_name, torch_dtype=torch.float16, low_cpu_mem_usage=True)
tokenizer = AutoTokenizer.from_pretrained(hf_name, use_fast=False)
print(f"Finished in {round((time.time() - st) / 60, 2)} mins")
print("Splitting model...")
st = time.time()
model = deepspeed.init_inference(model,
mp_size=world_size,
dtype=model.dtype,
replace_with_kernel_inject=True,
replace_method="auto",
max_out_tokens=2048,
)
print(os.environ)
generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer, device=local_rank)
print(f"Finished in {round((time.time() - st) / 60, 2)} mins")
for max_len in (128, 256, 512, 1024, 2048):
for batch_size in (1, 2, 4, 8, 16, 32):
for _ in range(n_infers):
generator(
"our story begins",
do_sample=True,
max_length=max_len,
num_return_sequences=batch_size,
)
see_memory_usage(f"max_len:{max_len}, batch_size:{batch_size}", force=True)
if __name__ == "__main__":
with torch.inference_mode():
infer()
Also, in regard to the performance degradation. I'm unable to get the OPT models running with the older version of deepspeed that you specified. Are you using the same script for that older version? How are you measuring latency?
Hi @aws-stdun @mrwyattii , I've also met this error
self.allocate_workspace(self.config.hidden_size,RuntimeError: Workspace is null.
. What does this error mean? I checked both RAM and GPU memory and they're not full, so I guess it's not a memory issue. What else can I do to debug?
@Wenhan-Tan what model are you running and on what hardware?
Hi @aws-stdun @mrwyattii , I've also met this error
self.allocate_workspace(self.config.hidden_size,RuntimeError: Workspace is null.
. What does this error mean? I checked both RAM and GPU memory and they're not full, so I guess it's not a memory issue. What else can I do to debug?@Wenhan-Tan what model are you running and on what hardware?
Hi @mrwyattii , I was running GPT3 on A100 40GB. BTW, the error is gone when I reduce batch size, so I believe it was limited by the GPU memory although it didn't show any CUDA OOM messages.
@mrwyattii I ran the code you provided, but I don't see any output when using deepspeed runner. Is there more to it? Btw, this issue still exists in v0.8.3 (can't even run batch size=1 with OPT-66)
@Wenhan-Tan @aws-stdun I know it has been many months, but if you are still seeing errors with the OPT model:
I just tested the code that @aws-stdun provided with the latest DeepSpeed and found that the problem is caused by num_return_sequences=batch_size
. Previously, I had assumed that transformers
would handle this similar to how it handles the case where we pass multiple inputs into a pipeline object, where under the hood, it is essentially running in a for loop over all inputs. But that is not the case and it would appear that the current implementation of DeepSpeed-Inference is not compatible with this option (similar to how we are not compatible with num_beams>1
).
@mrwyattii Hi, I haven't tested DeepSpeed since last time I replied to this bug
Describe the bug I am able to use DeepSpeed to perform inference at sequence lengths 128 and 256, up to batch size 16 for both. Beyond batch size 16 and/or sequence length 256, I hit an error (which may be triggered by OOM, although it is unclear).
Here is the relevant portion of the stacktrace for seq lens 128/256 @ batch size 32:
Using a sequence length of 512---even at batch size 1---produces a clear OOM:
Furthermore, the initial results from smaller sequence lengths show subpar performance compared with DeepSpeed 0.5.9. The latency is almost 2x what it was before (even with the prior bugs). This is using DeepSpeed built from source, v0.8.1+d59b5729
To Reproduce
Below is an example script
infer.py
. To reproduce a specific configuration failure, adjust these lines:infer.py
To use:
Expected behavior Expected this to mostly work, failing for the largest batch sizes / sequence lengths.
ds_report output
Screenshots N/A
System info (please complete the following information):
Docker context N/A
Additional context N/A