Open crazycth opened 1 year ago
@lekurile @jeffra @HeyangQin
according to https://github.com/microsoft/DeepSpeed/issues/2876 , I tried to load the model in FP16 and then set the dtype = torch.int8 in init_inference , but it still fails :
You can Regenerate this bug in quite a simple way :
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloomz-3b")
model = AutoModelForCausalLM.from_pretrained("bigscience/bloomz-3b",torch_dtype="auto",device_map="auto")
# This bug is encountered regardless of whether fp16 weights are enabled or not
# ckpt = torch.load('/mlx_devbox/users/chengtianhao.cc/playground/old_playground/bloom_deploy_git/deploy/fp16/fp16.pth', map_location='cpu')
# model.load_state_dict(ckpt['model'])
# init_inference
engine = deepspeed.init_inference(
model,
mp_size = 1,
dtype = torch.int8,
replace_with_kernel_inject = True
)
model = engine.module
inputs = tokenizer.encode("hello world", return_tensors="pt").to("cuda")
model.generate(inputs)
and then
https://github.com/microsoft/DeepSpeed/issues/2865 mention the same problem
Hey @crazycth - I encountered the same problem. Did you get any new insights into why it doesn't work?
Hey @crazycth - I encountered the same problem. Did you get any new insights into why it doesn't work?
Hey @trianxy Have you fixed this problem yet?
Hey @trianxy Have you fixed this problem yet?
Not yet
Describe the bug When inference bloom model with
replace_with_kernel_inject = True
, anddtype = torch.int8
For the reason that this model is trained by torch , I load the weight with torch.load , and then use weights loaded model to init engine ( is this right ? I tried to pass checkpoint in init_inference() , but it failed )
ckpt = torch.load(self.opt.model_file, map_location='cpu') self.model.load_state_dict(ckpt['model'])
inference init :
engine = deepspeed.init_inference(model.model, mp_size = 1, dtype = torch.int8, replace_with_kernel_inject = True)
inference error :
File "/usr/local/lib/python3.7/dist-packages/deepspeed/ops/transformer/inference/ds_attention.py", line 202, in compute_attention mixed_x_layer = mixed_x_layer.view(*new_tensor_shape) RuntimeError: shape '[9, 22, 32, 240]' is invalid for input of size 506880
but with dtype = torch.half , inference success.
ds_report output
Screenshots
System info (please complete the following information): OS: Debian GNU/Linux 10 GPU: NVIDIA A10 * 1 python : Python 3.7.3
Additional context Question : how to load weights in init_inference() with weights generated by
torch.save()
?