turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.18k stars 233 forks source link

Cannot load Llama-3 8B Instruct, incompatible function arguments #468

Closed nickpotafiy closed 1 month ago

nickpotafiy commented 1 month ago
TypeError: q_attn_forward_1(): incompatible function arguments. The following argument types are supported:
    1. (arg0: int, arg1: torch.Tensor, arg2: int, arg3: int, arg4: int, arg5: torch.Tensor, arg6: torch.Tensor, arg7: torch.Tensor, arg8: torch.Tensor, arg9: torch.Tensor, arg10: torch.Tensor, arg11: list[int], arg12: torch.Tensor) -> None

Hey @turboderp, latest version does not load a non-quantized model. Possibly q_handle being None does not sit well with that function call. Specifying 0 avoids this error, but it still fails on that forward call. I could dig into the issue but you probably could fix it quicker.

turboderp commented 1 month ago

This is because quantized and unquantized models use different methods for attention, and I hadn't included a paged method for unquantized models in v0.1.0. It's in the dev branch now, so the dynamic generator should work with FP16 models too, and I think I'll release v0.1.1 soon to fix some other incoming issues too.

nickpotafiy commented 1 month ago

Thanks!

turboderp commented 1 month ago

I released v0.1.1 now which should support FP16 models in the new generator.