[BUG] [0.8.1] INT8 model loading/inference issue

sindhuvahinis commented 1 year ago

Describe the bug

We conducted tests on OPT/GPTJ/GPT-Neox/BLOOM 7B INT8, these models are all producing garbage outputs on DeepSpeed 0.8.1

OPT model is NCCL communication issue
GPT-NeoX 20B is producing garbage
BLOOM-7B: shape '[1, 4, 32, 384]' is invalid for input of size 16384

How we tested? We generated int8 checkpoints of the model and then loaded them back. Example of doing the same with DS inference test suite.

deepspeed --num_nodes 1 \
    --num_gpus 8 \
    inference-test.py \
    --use_kernel \
    --ds_inference \
    --use_meta_tensor \
    --name EleutherAI/gpt-neox-20b \
    --checkpoint_path /tmp/ws/gpt-neox-20b/ \
    --save_mp_checkpoint_path /tmp/ws/sharded-gpt-neox-20b/ \
    --dtype int8

deepspeed --num_nodes 1 \
    --num_gpus 8     \
    inference-test.py     \
    --use_kernel     \
    --ds_inference     \
    --use_meta_tensor \
    --name EleutherAI/gpt-neox-20b     \
    --checkpoint_path /tmp/ws/sharded-gpt-neox-20b/ \
    --dtype int8

More info this. https://github.com/microsoft/DeepSpeed/issues/2770

Creating a new issue to track the int8 checkpoint loading issue.

lanking520 commented 1 year ago

@HeyangQin

HeyangQin commented 1 year ago

Hi @lanking520 @sindhuvahinis, PR https://github.com/microsoft/DeepSpeed/pull/2875 has been merged to address part of the issue. For now, the INT saving / loading is still not fully functional due to kernel issues. I would suggest you to use the workaround of saving checkpoints with fp32/fp16 and then load it with int8 to get around this issue for the time being.

lanking520 commented 1 year ago

Thanks for the info. Given the above context, at least the INT8 inference (load from FP16 ckpt) should work as expected?

lanking520 commented 1 year ago

@HeyangQin So I would assume developer should follow this path:

Load a model (e.g GPT-NeoX20B) and save to DS_sharded checkpoint in FP16.
Using ckpt loading on FP16 and set the dtype in init_inference to torch.int8

And this should work as expected. The only drawback is, developer may still facing runtime GPU OOM issue when converting FP16 to INT8 during runtime

crazycth commented 1 year ago

@HeyangQin Load BLOOM model with FP16 checkpoint and then set dtype=int8 in init_inference not work : (

Could u please answer this issue: https://github.com/microsoft/DeepSpeed/issues/2923 , and I found some people face the same problem .

trianxy commented 1 year ago

Just wanted to +1 this issue: At DeepSpeed 0.9.0, using torch.int8 in

deepspeed.init_inference(model, dtype=torch.int8, replace_with_kernel_inject=True)

raises errors for various models. Below is some code to quickly reproduce this problem with the small models GPT-neo-125m, Bloom 560m and gpt2:

# run on NVIDIA A10G, CUDA Version 11.7, Python 3.9

from typing import Any
import os
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

from transformers import AutoTokenizer, AutoModelForCausalLM  # v4.28.1
import torch  # v1.13.1
import deepspeed  # v.0.9.0

def print_next_token(model: Any) -> None:
    output = model(**inputs)
    token_id = torch.argmax(output.logits[0][-1])
    token = tokenizer.decode(token_id)
    print(f"{token=}")

architecture = "gpt2"
# architecture = "EleutherAI/gpt-neo-125m"
# architecture = "bigscience/bloom-560m"

device = "cuda"
tokenizer = AutoTokenizer.from_pretrained(architecture, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(architecture, low_cpu_mem_usage=True).to(device).eval()
inputs = tokenizer("George Washington was the first US", return_tensors="pt").to(device)

print_next_token(model) # prints ' president'

engine = deepspeed.init_inference(model, dtype= torch.int8, replace_with_kernel_inject=True)

print_next_token(engine.module) # -> error

Errors slightly differ, depending on the model:

gpt2 and gpt-neo-125m -> 
!!!! kernel execution error. (m: 768, n: 6, k: 2304, error: 13) 
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

bloom-560m -> 
!!!! kernel execution error. (m: 1024, n: 6, k: 3072, error: 13) 
RuntimeError: shape '[1, 6, 16, 192]' is invalid for input of size 6144

trianxy commented 1 year ago

Also wanted to point out that when using torch.int8 in deepspeed.init_inference(model, dtype=torch.int8, replace_with_kernel_inject=True), this code line is called which skips running WeightQuantization(...).model_quantize(...) and I am not sure if this is intended and related.

ccing you @RezaYazdaniAminabadi and @jeffra since you may have worked on this piece of code in this commit

Moran232 commented 1 year ago

Just wanted to +1 this issue: At DeepSpeed 0.9.0, using torch.int8 in

deepspeed.init_inference(model, dtype=torch.int8, replace_with_kernel_inject=True)

raises errors for various models. Below is some code to quickly reproduce this problem with the small models GPT-neo-125m, Bloom 560m and gpt2:

# run on NVIDIA A10G, CUDA Version 11.7, Python 3.9

from typing import Any
import os
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

from transformers import AutoTokenizer, AutoModelForCausalLM  # v4.28.1
import torch  # v1.13.1
import deepspeed  # v.0.9.0

def print_next_token(model: Any) -> None:
    output = model(**inputs)
    token_id = torch.argmax(output.logits[0][-1])
    token = tokenizer.decode(token_id)
    print(f"{token=}")

architecture = "gpt2"
# architecture = "EleutherAI/gpt-neo-125m"
# architecture = "bigscience/bloom-560m"

device = "cuda"
tokenizer = AutoTokenizer.from_pretrained(architecture, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(architecture, low_cpu_mem_usage=True).to(device).eval()
inputs = tokenizer("George Washington was the first US", return_tensors="pt").to(device)

print_next_token(model) # prints ' president'

engine = deepspeed.init_inference(model, dtype= torch.int8, replace_with_kernel_inject=True)

print_next_token(engine.module) # -> error

Errors slightly differ, depending on the model:

gpt2 and gpt-neo-125m -> 
!!!! kernel execution error. (m: 768, n: 6, k: 2304, error: 13) 
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

bloom-560m -> 
!!!! kernel execution error. (m: 1024, n: 6, k: 3072, error: 13) 
RuntimeError: shape '[1, 6, 16, 192]' is invalid for input of size 6144

same bug: kernel execution error, the error code is 13 or 14 or 15

SebastianBodza commented 1 year ago

Also wanted to point out that when using torch.int8 in deepspeed.init_inference(model, dtype=torch.int8, replace_with_kernel_inject=True), this code line is called which skips running WeightQuantization(...).model_quantize(...) and I am not sure if this is intended and related.

ccing you @RezaYazdaniAminabadi and @jeffra since you may have worked on this piece of code in this commit

Simply adjusting the statement does not work :)

model = deepspeed.init_inference(
  File "/home/a/miniforge3/envs/llm_bench/lib/python3.9/site-packages/deepspeed/__init__.py", line 342, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/home/a/miniforge3/envs/llm_bench/lib/python3.9/site-packages/deepspeed/inference/engine.py", line 161, in __init__
    self._convert_to_dtype(config)
  File "/home/a/miniforge3/envs/llm_bench/lib/python3.9/site-packages/deepspeed/inference/engine.py", line 524, in _convert_to_dtype
    model, self.quantization_scales = quantizer.model_quantize(self.module, self.injection_dict,
  File "/home/a/miniforge3/envs/llm_bench/lib/python3.9/site-packages/deepspeed/runtime/weight_quantizer.py", line 153, in model_quantize
    return quantized_module, torch.cat(all_scales)
RuntimeError: torch.cat(): expected a non-empty list of Tensors

microsoft / DeepSpeed

[BUG] [0.8.1] INT8 model loading/inference issue #2876