Closed aoom closed 1 month ago
Thanks for your suggestions. From our insights, GPUs are not well-suited for LUT due to their limited on-chip memory per core. Placing a LUT on shared memory can lead to slow random access due to bank conflict. However, it's still a viable solution to use CPU/GPU/NPU in concert, while GPU/NPU using dequant-based method and CPU using T-MAC. We are exploring the possibility.
If the enhancement kernel supports both cpu and gpu accelerated reasoning. In addition, it could support distributed computing. This would instantly become one of the faster and more useful inference engine algorithms!