pytorch-labs / gpt-fast

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
BSD 3-Clause "New" or "Revised" License
5.35k stars 484 forks source link

Can't quantize to int4 and can't compile on RTX2080Ti #124

Closed kaizizzzzzz closed 2 months ago

kaizizzzzzz commented 4 months ago

I have tried gpt-fast on RTX2080Ti.

I could run using int8 quant and without compilation.

However, it seems can't do the int4 quant and compile on RTX2080Ti.

For int4 quant It shows: RuntimeError: CUDA error: named symbol not found CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

For compile, it seems the kernel is compiled to use the bf16 calculation, and RTX2080 can't support bf16 calculation, so it may be the case.

But is it possible to solve the int4 quantization on RXT2080Ti?

msaroufim commented 2 months ago

2080 is Turing and that unfortunately does not have int4 support https://en.wikipedia.org/wiki/Turing_(microarchitecture)

I come to this Wikipedia page often to see what's supported where

I'd suggest replacing all the places in the code that say bf16 w/ fp16 and that should work just fine https://en.wikipedia.org/wiki/CUDA

kaizizzzzzz commented 2 months ago

Thx!