Closed aliencaocao closed 2 months ago
wow just saw https://github.com/turboderp/exllamav2/commit/6b14f8a04a4646fd13d75873ab27b04f6c17b5a6 I guess it will be better to merge our changes
Hm how about the returning logits and id part? Do you want me to draft a new PR for those?
Returning logits and IDs is fine, feel free to open a PR. Be aware that the first sampled token will be the healed token if token healing is enabled, but the text returned for that token is only the difference between the original and the healed token.
Returning logits for the completion isn't too demanding, they can be concatenated in system RAM. But to return full logits for the prompt requires the model to be loaded with config.max_output_len = config.max_input_len
which for some models requires an unrealistic amount of VRAM. It also requires the prefill pass to be called with preprocess_only = False
or otherwise the output layer is skipped.
It can't make it into 0.0.18 at any rate, because if I don't release now it'll be delayed for several days.
One issue left to resolve:
logits_hist
always have an extra token28705
whiledecode_ids
dont. Thus, length of logits is 1 more than decode id. This only happens with return_prompt = False since there will not be logits for prompt anyways.@turboderp any idea why?