Return token probabilities in generator.stream()

ivsanro1 commented 7 months ago

This PR adds the probability of the tokens to the return tuple of generator.stream()

TL;DR: I thought it could be positive to add it in general to exllamav2, as it can have several uses, and we already have the probabilities.

Background

I actually need this to:

do some exploration on a dataset, to find samples that have low probability and refine them (Active Learning kind of task), and
monitor out of distribution requests: monitor a service built on top of exllamav2, for when generated probabilities are too low

Another possible use-case is to print the tokens in a color/background that depends token probability. This is especially for analysis and debugging (e.g. like in OpenAI playground).

Something like this:

Peek 2023-12-21 13-21

turboderp commented 7 months ago

I don't mind the function returning probabilities, but it really needs to be optional. As it is this would break all other software currently using the streaming generator.

ivsanro1 commented 6 months ago

Thanks @turboderp for the feedback, that's a good point. Now it's optional and disabled by default.

Please tell me if you'd like me to implement it in some other way that adapts better to exllamav2 logic style

turboderp commented 6 months ago

Sorry for taking a while to get to this. I had to change the logic a little bit and return a tensor of probabilities instead, because the streaming function doesn't always return exactly one token:

If token healing is used, the return is the (string) difference between the last token of the prompt and the token it was changed into, but there is no accompanying token ID.
If the stop condition is a token ID, that token itself isn't returned when it's sampled
If the generator encounters a partial stop condition, it will not output anything but also will not return EOS. Instead any text and tokens are held until the condition is resolved (i.e. it's either met or can't be met), at which point any held text is returned along with the held tokens and now also the held probabilities for those tokens.

So the number of probs returned should always equal the number of tokens returned. You would use it something like this:

generator.return_probabilities = True

...

id_to_piece = tokenizer.get_id_to_piece_list()  # Get the cleaned tokenizer vocab

while True:

    chunk, eos, tokens, probs = generator.stream()
    generated_tokens += 1  # One token is always generated, even if it's held in the generator

    for i in range(tokens.shape[-1]):
        token = tokens[:, i].item()  # Token
        prob = probs[:, i].item()  # Probability (at the final sampling stage)
        piece = id_to_piece[token]  # Token piece
        print(f"{prob:8.5f} - {repr(piece)}")

    if eos or generated_tokens == max_new_tokens: break

And the output would be e.g.:

 0.33173 - ' lived'
 1.00000 - ' a'
 0.55938 - ' young'
 0.64523 - ' girl'
 1.00000 - ' named'
 0.06762 - ' E'
 0.79568 - 'lean'
 1.00000 - 'or'
 0.96364 - '.'
 0.71853 - ' E'
 1.00000 - 'lean'
 1.00000 - 'or'
 0.80822 - ' was'
 0.28309 - ' an'
 0.47551 - ' intelligent'
 0.97350 - ' and'
 1.00000 - ' curious'
 0.96953 - ' child'
 0.82568 - ' who'
 0.98809 - ' loved'
 0.88151 - ' nothing'
 1.00000 - ' more'
 1.00000 - ' than'

turboderp / exllamav2

Return token probabilities in generator.stream() #238

Background