microsoft / T-MAC

Low-bit LLM inference on CPU with lookup table
MIT License
588 stars 44 forks source link

Thanks for the excellent work, what happens when let the CPU and GPU (or other) do the inference operations at the same time. #20

Closed aoom closed 1 month ago

aoom commented 3 months ago

If the enhancement kernel supports both cpu and gpu accelerated reasoning. In addition, it could support distributed computing. This would instantly become one of the faster and more useful inference engine algorithms!

kaleid-liner commented 3 months ago

Thanks for your suggestions. From our insights, GPUs are not well-suited for LUT due to their limited on-chip memory per core. Placing a LUT on shared memory can lead to slow random access due to bank conflict. However, it's still a viable solution to use CPU/GPU/NPU in concert, while GPU/NPU using dequant-based method and CPU using T-MAC. We are exploring the possibility.