turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.2k stars 235 forks source link

Add support for return_logits, return_ids, return_prompt toggles in base generator #402

Closed aliencaocao closed 2 months ago

aliencaocao commented 2 months ago

One issue left to resolve: logits_hist always have an extra token 28705 while decode_ids dont. Thus, length of logits is 1 more than decode id. This only happens with return_prompt = False since there will not be logits for prompt anyways.

@turboderp any idea why?

aliencaocao commented 2 months ago

wow just saw https://github.com/turboderp/exllamav2/commit/6b14f8a04a4646fd13d75873ab27b04f6c17b5a6 I guess it will be better to merge our changes

aliencaocao commented 2 months ago

Hm how about the returning logits and id part? Do you want me to draft a new PR for those?

turboderp commented 2 months ago

Returning logits and IDs is fine, feel free to open a PR. Be aware that the first sampled token will be the healed token if token healing is enabled, but the text returned for that token is only the difference between the original and the healed token.

Returning logits for the completion isn't too demanding, they can be concatenated in system RAM. But to return full logits for the prompt requires the model to be loaded with config.max_output_len = config.max_input_len which for some models requires an unrealistic amount of VRAM. It also requires the prefill pass to be called with preprocess_only = False or otherwise the output layer is skipped.

It can't make it into 0.0.18 at any rate, because if I don't release now it'll be delayed for several days.