Closed yash-khurana closed 1 year ago
Hi @yash-khurana,
Thank you for your interest in our work. In case your target hardware is GPU or Jetson devices, you may try to convert the model to TensorRT and quantize it to Int8 for optimized inference. For CPU inference, you may try exploring OpenVino from intel.
I have not tried above with EdgeNeXt but from my previous experience I believe that the above optimization can give you a reasonable speed-up. Do let everyone know if you will able to get any speedup.
Thank You
Thank you for your response. Last I checked, neither PyTorch nor ONNX provide much support for int8 layers of transformers. I might be wrong though. I would be happy to explore them if you could point me in the direction of maybe a similar model being quantised to int8 or give me a starting point for it.
Thank You @yash-khurana,
You may use TensorRT command line trtexec
to convert the ONNX into TRT engine with INT8 precision for inference on NVIDIA devices. Further have a look at the (Python samples)[https://github.com/NVIDIA/TensorRT/tree/main/samples/python/efficientnet] converting and inferring EfficientNet models to TensorRT. I hope this would be helpful.
Further, we are planning to release one of our incremental work on Efficient Models for Edge Devices in a couple of weeks and are planning to provide detailed instructions on inference on NVIDIA devices and IPhone. Stay Tuned! Thanks
@mmaaz60 Can you please provide the code for converting this model to int8? It is successfully converting to ONNX and I am able to infer using ONNXRuntime. However, is there any way to decrease the inference time further? Int8 or quantisation or pruning? I'm using edgenext_xx_small_bn_hs. Thanks a lot! Love your work!