Open HamidShojanazeri opened 2 years ago
Hi Hamid,
Thanks for trying the quantization technique and reporting some of these issues. Yes, you are right about the bug with the input_mask. I will send a PR to fix that. Regarding the performance, as you say there is not any performance improvement as the kernels we released already is not fusing the dequantization and GeMMs during inference. We have customized kernels for this which we will release a binary to help improve the performance. Thanks for mentioning the need for the e2e tutorial. I will work on adding that tutorial. Best, Reza
Hey @RezaYazdaniAminabadi, where can I find the customized kernels to improve the performance?
I have a question on MoQ usage and appreciate the help. I have run the example that train a bert-base model with MoQ, then trying to load it back for inference using the code snippet below.
This runs into an issue that complains about missing input_mask in the inputs, logs,
I was wondering if the setting for inference engine is missing something and what would be the best practice to run inference on a MOQ quantized model. Thanks.
--- Update, it seems like a bug that compute_attention(qkv_out) is missing input_mask, added and could pass the error.
However, I am not seeing any speed up (it actually degraded) compared to running a based-bert model with the inference engine. (quantized 6.62 ms vs bert-base 5.18 ms). I wonder if somehow should have injected quantized modules. I appreciate your help and seems an e2e tutorial for MoQ + inference would be very helpful for the community.
@RezaYazdaniAminabadi
System info (please complete the following information):