Hi, I meet some errors when serving Qwen1.5-32B using fp6 quantization, could you please give me a help? Thank you.
My code is below:
import mii
pipe = mii.pipeline('/mymodel/Qwen1.5-32B-fp16', quantization_mode='wf6af16', tensor_parallel=8)
response = pipe(["DeepSpeed is", "Seattle is"], max_new_tokens=128)
print(response)
And the error is:
Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
python: /root/miniconda3/envs/pytorch22/lib/python3.10/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_linear/include/weight_prepacking.h:151: void weight_matrix_prepacking(int*, size_t, size_t): Assertion `K % 64 == 0' failed.
Hi, I meet some errors when serving Qwen1.5-32B using fp6 quantization, could you please give me a help? Thank you. My code is below: import mii pipe = mii.pipeline('/mymodel/Qwen1.5-32B-fp16', quantization_mode='wf6af16', tensor_parallel=8) response = pipe(["DeepSpeed is", "Seattle is"], max_new_tokens=128) print(response)
And the error is:
Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... python: /root/miniconda3/envs/pytorch22/lib/python3.10/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_linear/include/weight_prepacking.h:151: void weight_matrix_prepacking(int*, size_t, size_t): Assertion `K % 64 == 0' failed.