Int8 deployment does not work for GPTJ

Trying to deploy GPTJ with int8 leads to the server crashing and the following stack trace:

╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /usr/lib/python3.8/runpy.py:194 in _run_module_as_main                       │
│                                                                              │
│   191 │   main_globals = sys.modules["__main__"].__dict__                    │
│   192 │   if alter_argv:                                                     │
│   193 │   │   sys.argv[0] = mod_spec.origin                                  │
│ ❱ 194 │   return _run_code(code, main_globals, None,                         │
│   195 │   │   │   │   │    "__main__", mod_spec)                             │
│   196                                                                        │
│   197 def run_module(mod_name, init_globals=None,                            │
│                                                                              │
│ /usr/lib/python3.8/runpy.py:87 in _run_code                                  │
│                                                                              │
│    84 │   │   │   │   │      __loader__ = loader,                            │
│    85 │   │   │   │   │      __package__ = pkg_name,                         │
│    86 │   │   │   │   │      __spec__ = mod_spec)                            │
│ ❱  87 │   exec(code, run_globals)                                            │
│    88 │   return run_globals                                                 │
│    89                                                                        │
│    90 def _run_module_code(code, init_globals=None,                          │
│                                                                              │
│ /usr/local/lib/python3.8/dist-packages/mii/launch/multi_gpu_server.py:70 in  │
│ <module>                                                                     │
│                                                                              │
│   67                                                                         │
│   68 if __name__ == "__main__":                                              │
│   69 │   # python -m mii.launch.multi_gpu_server                             │
│ ❱ 70 │   main()                                                              │
│   71                                                                         │
│                                                                              │
│ /usr/local/lib/python3.8/dist-packages/mii/launch/multi_gpu_server.py:56 in  │
│ main                                                                         │
│                                                                              │
│   53 │   local_rank = int(os.getenv('LOCAL_RANK', '0'))                      │
│   54 │   port = args.port + local_rank                                       │
│   55 │                                                                       │
│ ❱ 56 │   inference_pipeline = load_models(task_name=args.task_name,          │
│   57 │   │   │   │   │   │   │   │   │    model_name=args.model,             │
│   58 │   │   │   │   │   │   │   │   │    model_path=args.model_path,        │
│   59 │   │   │   │   │   │   │   │   │    ds_optimize=args.ds_optimize,      │
│                                                                              │
│ /usr/local/lib/python3.8/dist-packages/mii/models/load_models.py:46 in       │
│ load_models                                                                  │
│                                                                              │
│    43 │   │   if "bigscience/bloom" in model_name:                           │
│    44 │   │   │   assert mii_config.dtype == torch.half or mii_config.dtype  │
│    45 │   │   │   assert mii_config.enable_cuda_graph == False, "Bloom model │
│ ❱  46 │   │   inference_pipeline = hf_provider(model_path, model_name, task_ │
│    47 │   elif provider == mii.constants.ModelProvider.ELEUTHER_AI:          │
│    48 │   │   from mii.models.providers.eleutherai import eleutherai_provide │
│    49 │   │   assert mii_config.dtype == torch.half, "gpt-neox only support  │
│                                                                              │
│ /usr/local/lib/python3.8/dist-packages/mii/models/providers/huggingface.py:1 │
│ 2 in hf_provider                                                             │
│                                                                              │
│    9 │   else:                                                               │
│   10 │   │   local_rank = int(os.getenv('LOCAL_RANK', '0'))                  │
│   11 │   │   device = torch.device(f"cuda:{local_rank}")                     │
│ ❱ 12 │   inference_pipeline = pipeline(                                      │
│   13 │   │   task_name,                                                      │
│   14 │   │   model=model_name,                                               │
│   15 │   │   device=device,                                                  │
│                                                                              │
│ /usr/local/lib/python3.8/dist-packages/transformers/pipelines/__init__.py:75 │
│ 4 in pipeline                                                                │
│                                                                              │
│   751 │   # Forced if framework already defined, inferred if it's None       │
│   752 │   # Will load the correct model if possible                          │
│   753 │   model_classes = {"tf": targeted_task["tf"], "pt": targeted_task["p │
│ ❱ 754 │   framework, model = infer_framework_load_model(                     │
│   755 │   │   model,                                                         │
│   756 │   │   model_classes=model_classes,                                   │
│   757 │   │   config=config,                                                 │
│                                                                              │
│ /usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py:266 in │
│ infer_framework_load_model                                                   │
│                                                                              │
│    263 │   │   │   │   continue                                              │
│    264 │   │                                                                 │
│    265 │   │   if isinstance(model, str):                                    │
│ ❱  266 │   │   │   raise ValueError(f"Could not load model {model} with any  │
│    267 │                                                                     │
│    268 │   framework = "tf" if model.__class__.__name__.startswith("TF") els │
│    269 │   return framework, model

Guidance for how to deploy the model as int8 would be great. Do I have to quantize the model on my own before I can do this?

@mallorbc I've just created a branch that should fix this error. Could try installing this version and trying again? pip install git+https://github.com/microsoft/DeepSpeed-MII@mrwyattii/enable-int8-HF

The root of the problem is that we cannot load the model via HuggingFace with the int8 datatype. This fix initially loads the model with fp16 when the dtype is set to int8 and when we give the model to DeepSpeed-Inference later, we convert to int8.

@mrwyattii Thanks for the help, but this did not fix the issue. I have tried getting this working with and without DeepSpeed MII, which means I have tried just DeepSpeed and then DeepSpeed indirectly through DeepSpeed MII(in case I was doing it wrong).

Each time the error is the same Please see the track trace below:

  File "/app/server.py", line 220, in generate
    gen_text = gpt_model.query({"query":prompt}, do_sample=do_sample, max_length=total_max_length,min_length=total_min_length,temperature=temp_input,top_k=top_k_input,top_p=top_p_input,early_stopping=early_stopping_input,repetition_penalty=rep_penalty_input,penalty_alpha=penalty_alpha)
  File "/usr/local/lib/python3.8/dist-packages/mii/client.py", line 123, in query
    response = self.asyncio_loop.run_until_complete(
  File "/usr/local/lib/python3.8/dist-packages/nest_asyncio.py", line 90, in run_until_complete
    return f.result()
  File "/usr/lib/python3.8/asyncio/futures.py", line 178, in result
    raise self._exception
  File "/usr/lib/python3.8/asyncio/tasks.py", line 282, in __step
    result = coro.throw(exc)
  File "/usr/local/lib/python3.8/dist-packages/mii/client.py", line 107, in _query_in_tensor_parallel
    await responses[0]
  File "/usr/lib/python3.8/asyncio/futures.py", line 260, in __await__
    yield self  # This tells Task to wait for completion.
  File "/usr/lib/python3.8/asyncio/tasks.py", line 349, in __wakeup
    future.result()
  File "/usr/lib/python3.8/asyncio/futures.py", line 178, in result
    raise self._exception
  File "/usr/lib/python3.8/asyncio/tasks.py", line 280, in __step
    result = coro.send(None)
  File "/usr/local/lib/python3.8/dist-packages/mii/client.py", line 70, in _request_async_response
    proto_response = await getattr(self.stub, conversions["method"])(proto_request)
  File "/usr/local/lib/python3.8/dist-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
    status = StatusCode.UNKNOWN
    details = "Exception calling application: invalid multinomial distribution (sum of probabilities <= 0)"
    debug_error_string = "UNKNOWN:Error received from peer ipv4:127.0.0.1:50050 {created_time:"2023-03-07T06:20:02.49342625+00:00", grpc_status:2, grpc_message:"Exception calling application: invalid multinomial distribution (sum of probabilities <= 0)"}"

Part of me is confused if this is supposed to work out of the box for GPTJ. In the docs, I see there is a quantization config option, which is the quantization settings you get from MoQ. At the same time, I know that you can get out of the box int8 with methods like bits and bytes.

See here

If this is supposed to work, there is some bug. If I need to do MoQ first, then I am just using DeepSpeed wrong.

Ahh I didn't try running a query with that branch, but I see the same error when I do. I recall int8 working with GPT-J several months ago when loading with meta tensors. I'll setup an environment and do some more testing with DeepSpeed.

Also, thank you for opening the DeepSpeed issue - I will follow up with this error there.

Thanks for your help. Hope you find a solution. I will see if I can find anything that may help based on the Meta Tensors you mentioned.

With DeepSpeed 0.8.2 JIT I get an new error:

Setting pad_token_id to eos_token_id:50256 for open-end generation. !!!! kernel execution error. (m: 16384, n: 4, k: 4096, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 12288, error: 13) !!!! kernel execution error. (m: 4, n: 4, k: 85, error: 13) !!!! kernel execution error. (m: 85, n: 4, k: 4, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 4096, error: 13) Free memory : 5.544067 (GigaBytes) Total memory: 23.691101 (GigaBytes) Requested memory: 1.375000 (GigaBytes) Setting maximum total tokens (input + output) to 2048 !!!! kernel execution error. (m: 4096, n: 4, k: 16384, error: 13) !!!! kernel execution error. (m: 16384, n: 4, k: 4096, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 12288, error: 13) !!!! kernel execution error. (m: 4, n: 4, k: 85, error: 13) !!!! kernel execution error. (m: 85, n: 4, k: 4, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 4096, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 16384, error: 13) !!!! kernel execution error. (m: 16384, n: 4, k: 4096, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 12288, error: 13) !!!! kernel execution error. (m: 4, n: 4, k: 85, error: 13) !!!! kernel execution error. (m: 85, n: 4, k: 4, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 4096, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 16384, error: 13) !!!! kernel execution error. (m: 16384, n: 4, k: 4096, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 12288, error: 13) !!!! kernel execution error. (m: 4, n: 4, k: 85, error: 13) !!!! kernel execution error. (m: 85, n: 4, k: 4, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 4096, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 16384, error: 13) !!!! kernel execution error. (m: 16384, n: 4, k: 4096, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 12288, error: 13) !!!! kernel execution error. (m: 4, n: 4, k: 85, error: 13) !!!! kernel execution error. (m: 85, n: 4, k: 4, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 4096, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 16384, error: 13) !!!! kernel execution error. (m: 16384, n: 4, k: 4096, error: 13) File "/app/server.py", line 249, in generate gen_text = gpt_model(prompt, do_sample=do_sample, max_length=total_max_length,min_length=total_min_length,temperature=temp_input,top_k=top_k_input,top_p=top_p_input,early_stopping=early_stopping_input,bad_words_ids=bad_word_ids,batch_size=len(prompt),num_beams=num_beams,penalty_alpha=penalty_alpha) File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/text_generation.py", line 210, in call return super().call(text_inputs, kwargs) File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py", line 1065, in call outputs = [output for output in final_iterator] File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py", line 1065, in outputs = [output for output in final_iterator] File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/pt_utils.py", line 124, in next item = next(self.iterator) File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/pt_utils.py", line 125, in next processed = self.infer(item, self.params) File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py", line 992, in forward model_outputs = self._forward(model_inputs, forward_params) File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/text_generation.py", line 252, in _forward generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, generate_kwargs) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/engine.py", line 588, in _generate return self.module.generate(*inputs, kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, *kwargs) File "/usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py", line 1391, in generate return self.greedy_search( File "/usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py", line 2179, in greedy_search outputs = self( File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/usr/local/lib/python3.8/dist-packages/transformers/models/gptj/modeling_gptj.py", line 836, in forward lm_logits = self.lm_head(hidden_states).to(torch.float32) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)

same issue, version details:

deepspeed==0.8.3
transformers==4.27.1
torch==1.13.0
cuda toolkit == 11.7
cuda driver == 515.65.07

my model is bigscience/bloom-7b1 and it only occurs when inference with multi gpu and int8 dtype

@Yangruipis For int8 you didn't need to do any quantization of the weights beforehand right?

@Yangruipis For int8 you didn't need to do any quantization of the weights beforehand right?

yes, just change the dtype params of func deepspeed.init_inference from fp16 to int8, and this works with 1 gpu only

@mallorbc I've made it with https://github.com/triton-inference-server/fastertransformer_backend.

My model is bloom-175b, and it worked on 8 X A100(80G) & int8 dtype. And as i tested, FasterTransformer is the fastest inference framework now (comparing to deepseed and accelerate), and int8 is faster than fp16.

What you need to do is to build the triton_with_ft image with the latest FasterTransformer code base.

microsoft / DeepSpeed-MII

Int8 deployment does not work for GPTJ #155