turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.28k stars 243 forks source link

Attempted to quant a custom MoE model, Plap-8x13b, and get an error. #235

Closed NiriProject closed 1 month ago

NiriProject commented 7 months ago

https://huggingface.co/Undi95/Plap-8x13B I tried to quant the model above, and no matter what settings I put it always fails with the following error;

Traceback (most recent call last):
  File "C:\Users\User\Desktop\exllamav2-master\convert.py", line 220, in <module>
    measure_quant(job, save_job, model)
  File "C:\Users\User\miniconda3\envs\death2\Lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\User\Desktop\exllamav2-master\conversion\measure.py", line 388, in measure_quant
    m = measure_moe_mlp(module, hidden_states, target_states, quantizers, cache, attn_mask)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\User\Desktop\exllamav2-master\conversion\measure.py", line 214, in measure_moe_mlp
    quantizers[f"w2.{i}"].prepare()
  File "C:\Users\User\Desktop\exllamav2-master\conversion\adaptivegptq.py", line 225, in prepare
    self.hessian /= self.num_batches
TypeError: unsupported operand type(s) for /=: 'NoneType' and 'int'

While I know supporting every odd model out there isn't feasible. I think the MoE future we're barreling towards needs more flexible support. Seems this will be the current thing for quite a while.

turboderp commented 7 months ago

This happens if a matrix in the model sees no calibration data at all during the entire reference forward pass. I.e. all of the 32k tokens in the measurement dataset were routed around one or more specific experts, which is incredibly unlikely if the routing layers are working as they should.

Sadly it's a bit hard to say more than that without knowing how the model was put together.