[BUG] Bloom inference error with dtype=int8

crazycth commented 1 year ago

Describe the bug When inference bloom model with replace_with_kernel_inject = True , and dtype = torch.int8

For the reason that this model is trained by torch , I load the weight with torch.load , and then use weights loaded model to init engine ( is this right ? I tried to pass checkpoint in init_inference() , but it failed )

ckpt = torch.load(self.opt.model_file, map_location='cpu') self.model.load_state_dict(ckpt['model'])

inference init : engine = deepspeed.init_inference(model.model, mp_size = 1, dtype = torch.int8, replace_with_kernel_inject = True)

inference error : File "/usr/local/lib/python3.7/dist-packages/deepspeed/ops/transformer/inference/ds_attention.py", line 202, in compute_attention mixed_x_layer = mixed_x_layer.view(*new_tensor_shape) RuntimeError: shape '[9, 22, 32, 240]' is invalid for input of size 506880

but with dtype = torch.half , inference success.

ds_report output

Screenshots

System info (please complete the following information): OS: Debian GNU/Linux 10 GPU: NVIDIA A10 * 1 python : Python 3.7.3

Additional context Question : how to load weights in init_inference() with weights generated by torch.save() ?

crazycth commented 1 year ago

@lekurile @jeffra @HeyangQin

according to https://github.com/microsoft/DeepSpeed/issues/2876 , I tried to load the model in FP16 and then set the dtype = torch.int8 in init_inference , but it still fails :

You can Regenerate this bug in quite a simple way :

tokenizer = AutoTokenizer.from_pretrained("bigscience/bloomz-3b")
model = AutoModelForCausalLM.from_pretrained("bigscience/bloomz-3b",torch_dtype="auto",device_map="auto")

# This bug is encountered regardless of whether fp16 weights are enabled or not
# ckpt = torch.load('/mlx_devbox/users/chengtianhao.cc/playground/old_playground/bloom_deploy_git/deploy/fp16/fp16.pth', map_location='cpu')
# model.load_state_dict(ckpt['model'])

# init_inference
engine = deepspeed.init_inference(
    model,
    mp_size = 1,
    dtype = torch.int8,
    replace_with_kernel_inject = True
)
model = engine.module

inputs = tokenizer.encode("hello world", return_tensors="pt").to("cuda")
model.generate(inputs)

and then

crazycth commented 1 year ago

https://github.com/microsoft/DeepSpeed/issues/2865 mention the same problem

trianxy commented 1 year ago

Hey @crazycth - I encountered the same problem. Did you get any new insights into why it doesn't work?

Moran232 commented 1 year ago

Hey @crazycth - I encountered the same problem. Did you get any new insights into why it doesn't work?

Hey @trianxy Have you fixed this problem yet?

trianxy commented 1 year ago

Hey @trianxy Have you fixed this problem yet?

Not yet

microsoft / DeepSpeed

[BUG] Bloom inference error with dtype=int8 #2923