[BUG] Int8 Inference Does Not Work For GPTJ

mallorbc commented 1 year ago

Describe the bug Trying to use DeepSpeed Inference with int8 does not work for GPTJ. I get created an issue that has more details on the DeepSpeed MII repo, but due to the nature of the issue, I feel it likely belongs here.

https://github.com/microsoft/DeepSpeed-MII/issues/155

To Reproduce Steps to reproduce the behavior:

Load GPTJ with float16

model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B",torch_dtype=torch.float16).cuda()

Load model with DeepSpeed

world_size = 1
dtype = torch.int8
engine = deepspeed.init_inference(model,
                                   mp_size=world_size,
                                    dtype=dtype,
                                    replace_method='auto',
                                    max_tokens=2048,
                replace_with_kernel_inject=True)

Try to generate tokens
Alternatively, use DeepSpeed MII for the same issue

Expected behavior I expect a memory reduction, and speed improvement, for little or no degradation in performance ds_report output

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

async_io ............... [YES] ...... [OKAY] cpu_adagrad ............ [YES] ...... [OKAY] cpu_adam ............... [YES] ...... [OKAY] fused_adam ............. [YES] ...... [OKAY] fused_lamb ............. [YES] ...... [OKAY] quantizer .............. [YES] ...... [OKAY] random_ltd ............. [YES] ...... [OKAY] sparse_attn ............ [YES] ...... [OKAY] spatial_inference ...... [YES] ...... [OKAY] transformer ............ [YES] ...... [OKAY] stochastic_transformer . [YES] ...... [OKAY] transformer_inference .. [YES] ...... [OKAY] utils .................. [YES] ...... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch'] torch version .................... 1.13.1+cu117 deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed'] deepspeed info ................... 0.8.0+bf6b9802, bf6b9802, HEAD torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 11.7 deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7

System info (please complete the following information):

OS: Ubuntu 20.04
Two RTX 3090s
See the linked issue for DeepSpeed MII info

Docker context is very similar but not identical to this

pip list: Package Version

accelerate 0.16.0 aiohttp 3.8.4 aiosignal 1.3.1 anyio 3.6.2 async-timeout 4.0.2 asyncio 3.4.3 attr 0.3.2 attrs 22.2.0 bitsandbytes-cuda117 0.30.1 certifi 2022.12.7 charset-normalizer 3.0.1 click 8.1.3 coloredlogs 15.0.1 datasets 2.10.1 deepspeed 0.8.0+bf6b9802 deepspeed-mii 0.0.5+bb801d3 dill 0.3.6 evaluate 0.4.0 fastapi 0.89.1 filelock 3.9.0 flatbuffers 23.3.3 frozenlist 1.3.3 fsspec 2023.3.0 grpcio 1.51.3 grpcio-tools 1.51.3 h11 0.14.0 hjson 3.1.0 huggingface-hub 0.12.0 humanfriendly 10.0 idna 3.4 markdown-it-py 2.1.0 mdurl 0.1.2 mpmath 1.2.1 multidict 6.0.4 multiprocess 0.70.14 nest-asyncio 1.5.6 ninja 1.11.1 numpy 1.24.2 nvidia-cublas-cu11 11.10.3.66 nvidia-cuda-nvrtc-cu11 11.7.99 nvidia-cuda-runtime-cu11 11.7.99 nvidia-cudnn-cu11 8.5.0.96 nvidia-pyindex 1.0.9 onnx 1.13.1 onnxruntime 1.14.1 onnxruntime-gpu 1.14.1 optimum 1.7.1 packaging 23.0 pandas 1.5.3 Pillow 9.4.0 pip 20.0.2 polygraphy 0.44.2 protobuf 3.20.2 psutil 5.9.4 py-cpuinfo 9.0.0 pyarrow 11.0.0 pydantic 1.10.4 Pygments 2.14.0 python-dateutil 2.8.2 pytz 2022.7.1 PyYAML 6.0 redis 4.5.0 regex 2022.10.31 requests 2.28.2 responses 0.18.0 rich 13.3.1 sentencepiece 0.1.97 setuptools 45.2.0 six 1.16.0 sniffio 1.3.0 starlette 0.22.0 sympy 1.11.1 timm 0.6.12 tokenizers 0.13.2 torch 1.13.1 torchaudio 0.13.1 torchvision 0.14.1 tqdm 4.64.1 transformers 4.26.0 triton 1.0.0 typing-extensions 4.4.0 Unidecode 1.3.6 urllib3 1.26.14 uvicorn 0.20.0 wheel 0.34.2 xxhash 3.2.0 yarl 1.8.2

Additional context I may be mistaken and this was never supposed to work outside of the box. Perhaps MoQ is required?

mallorbc commented 1 year ago

With DeepSpeed 0.8.2 JIT I get an new error:

Setting pad_token_id to eos_token_id:50256 for open-end generation. !!!! kernel execution error. (m: 16384, n: 4, k: 4096, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 12288, error: 13) !!!! kernel execution error. (m: 4, n: 4, k: 85, error: 13) !!!! kernel execution error. (m: 85, n: 4, k: 4, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 4096, error: 13) Free memory : 5.544067 (GigaBytes) Total memory: 23.691101 (GigaBytes) Requested memory: 1.375000 (GigaBytes) Setting maximum total tokens (input + output) to 2048 !!!! kernel execution error. (m: 4096, n: 4, k: 16384, error: 13) !!!! kernel execution error. (m: 16384, n: 4, k: 4096, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 12288, error: 13) !!!! kernel execution error. (m: 4, n: 4, k: 85, error: 13) !!!! kernel execution error. (m: 85, n: 4, k: 4, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 4096, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 16384, error: 13) !!!! kernel execution error. (m: 16384, n: 4, k: 4096, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 12288, error: 13) !!!! kernel execution error. (m: 4, n: 4, k: 85, error: 13) !!!! kernel execution error. (m: 85, n: 4, k: 4, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 4096, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 16384, error: 13) !!!! kernel execution error. (m: 16384, n: 4, k: 4096, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 12288, error: 13) !!!! kernel execution error. (m: 4, n: 4, k: 85, error: 13) !!!! kernel execution error. (m: 85, n: 4, k: 4, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 4096, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 16384, error: 13) !!!! kernel execution error. (m: 16384, n: 4, k: 4096, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 12288, error: 13) !!!! kernel execution error. (m: 4, n: 4, k: 85, error: 13) !!!! kernel execution error. (m: 85, n: 4, k: 4, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 4096, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 16384, error: 13) !!!! kernel execution error. (m: 16384, n: 4, k: 4096, error: 13) File "/app/server.py", line 249, in generate gen_text = gpt_model(prompt, do_sample=do_sample, max_length=total_max_length,min_length=total_min_length,temperature=temp_input,top_k=top_k_input,top_p=top_p_input,early_stopping=early_stopping_input,bad_words_ids=bad_word_ids,batch_size=len(prompt),num_beams=num_beams,penalty_alpha=penalty_alpha) File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/text_generation.py", line 210, in call return super().call(text_inputs, kwargs) File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py", line 1065, in call outputs = [output for output in final_iterator] File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py", line 1065, in outputs = [output for output in final_iterator] File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/pt_utils.py", line 124, in next item = next(self.iterator) File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/pt_utils.py", line 125, in next processed = self.infer(item, self.params) File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py", line 992, in forward model_outputs = self._forward(model_inputs, forward_params) File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/text_generation.py", line 252, in _forward generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, generate_kwargs) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/engine.py", line 588, in _generate return self.module.generate(*inputs, kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, *kwargs) File "/usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py", line 1391, in generate return self.greedy_search( File "/usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py", line 2179, in greedy_search outputs = self( File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/usr/local/lib/python3.8/dist-packages/transformers/models/gptj/modeling_gptj.py", line 836, in forward lm_logits = self.lm_head(hidden_states).to(torch.float32) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)

mallorbc commented 1 year ago

If I run the code with something like this it seems to work:

            gpt_model.model = deepspeed.init_inference(gpt_model.model,
                                    mp_size=world_size,
                                    dtype=dtype,
                                    max_tokens=args.max_tokens)

By removing replace_with_kernel_inject=True it seems to fix the issues I have been having. Are there no supported optimized kernels for int8?

Edit: Actually it seems to be running with float16 still

trianxy commented 1 year ago

Hey @mallorbc - were you able to learn more about the above problem?

Whenever I used dtype=torch.int8, it either crashed, or - if it didn't - the accuracy and speed was the same like dtype=torch.float16 and it seemed to me that nothing was actually changed inside the model.

Moran232 commented 1 year ago

Hey @mallorbc - were you able to learn more about the above problem?

Whenever I used dtype=torch.int8, it either crashed, or - if it didn't - the accuracy and speed was the same like dtype=torch.float16 and it seemed to me that nothing was actually changed inside the model.

Hey @trianxy I find you under most of the int8 question, have you fixed your problem with int8?

StarLooo commented 10 months ago

The same two question when use deepspeed inference: (1) It seems that replace_with_kernel_inject=True conflict with dtype=torch.int8 and causes "CUDA error: an illegal memory access was encountered". (2) With setting replace_with_kernel_inject=False, I found that dtype=torch.int8 and dtype=torch.float16 are totaly the same in inference speed and GPU memory. (3) When using replace_with_kernel_inject=True and dtype=torch.float16, I found it cost more GPU memory than setting replace_with_kernel_inject=False and dtype=torch.float16.

siva-sankar-a commented 5 months ago

Could you let me know if this issue has been solved? Can I use deepspeed inference with int8 and quantization?

mikeymezher commented 3 months ago

Looking at the kernels called in deepspeed/model_implementations/ds_transformer and corresponding ops; it looks like torch.int8 types all call float16 kernel functions.

kiucho commented 1 month ago

Still can't use replace_with_kernel_inject=True and 'dtype=torch.int8` at the same time.. Is there any progress??

microsoft / DeepSpeed