turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.45k stars 257 forks source link

70B Quant Potential Issue #151

Closed azureblackprime closed 10 months ago

azureblackprime commented 10 months ago

We have been trying to do a 2.55bpw on a model and am seeing something odd in the logs. A couple of us have tried across multiple devices and it appears it is choosing the 2.17bpw for most of the layers. I ran a separate at the 4.6bpw and that came out with no issues.

Don't understand all the details of this process so not sure if it is a user error or expected output. Included the first ten layers from a couple of the runs and then one of the measurement files. Have tried a few measurement files including wikitext with the same output.

quant log.txt measurement.json

turboderp commented 10 months ago

2.17bpw is the lowest setting for a layer, made of a mix of 5% 3-bit rows, 95% 2-bit rows at group size 32. With a 4-bit scale parameter per group and some overhead for the row permutation and such, it works out to 2.17. The next step up from that is 2.38, then 2.63.

The point of the measurement is to determine how sensitive each layer is to quantization errors so that the optimization step can create a plan for quantizing the whole model that minimizes the overall error. As it happens, later layers of the model typically need more bits, as do the MLP layers in general since they're higher effective rank than the attention projections.

So seeing a lot of 2.17bpw layers with a 2.55bpw target is normal and expected. It doesn't have a lot of extra bits over the minimum, so it's very selectively applying them where they're needed the most.

azureblackprime commented 10 months ago

Thanks that is helpful.