turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.28k stars 243 forks source link

Return token probabilities in generator.stream() #238

Closed ivsanro1 closed 6 months ago

ivsanro1 commented 7 months ago

This PR adds the probability of the tokens to the return tuple of generator.stream()

TL;DR: I thought it could be positive to add it in general to exllamav2, as it can have several uses, and we already have the probabilities.

Background

I actually need this to:

Another possible use-case is to print the tokens in a color/background that depends token probability. This is especially for analysis and debugging (e.g. like in OpenAI playground).

Something like this:

Peek 2023-12-21 13-21

turboderp commented 7 months ago

I don't mind the function returning probabilities, but it really needs to be optional. As it is this would break all other software currently using the streaming generator.

ivsanro1 commented 6 months ago

Thanks @turboderp for the feedback, that's a good point. Now it's optional and disabled by default.

Please tell me if you'd like me to implement it in some other way that adapts better to exllamav2 logic style

turboderp commented 6 months ago

Sorry for taking a while to get to this. I had to change the logic a little bit and return a tensor of probabilities instead, because the streaming function doesn't always return exactly one token:

So the number of probs returned should always equal the number of tokens returned. You would use it something like this:

generator.return_probabilities = True

...

id_to_piece = tokenizer.get_id_to_piece_list()  # Get the cleaned tokenizer vocab

while True:

    chunk, eos, tokens, probs = generator.stream()
    generated_tokens += 1  # One token is always generated, even if it's held in the generator

    for i in range(tokens.shape[-1]):
        token = tokens[:, i].item()  # Token
        prob = probs[:, i].item()  # Probability (at the final sampling stage)
        piece = id_to_piece[token]  # Token piece
        print(f"{prob:8.5f} - {repr(piece)}")

    if eos or generated_tokens == max_new_tokens: break

And the output would be e.g.:

 0.33173 - ' lived'
 1.00000 - ' a'
 0.55938 - ' young'
 0.64523 - ' girl'
 1.00000 - ' named'
 0.06762 - ' E'
 0.79568 - 'lean'
 1.00000 - 'or'
 0.96364 - '.'
 0.71853 - ' E'
 1.00000 - 'lean'
 1.00000 - 'or'
 0.80822 - ' was'
 0.28309 - ' an'
 0.47551 - ' intelligent'
 0.97350 - ' and'
 1.00000 - ' curious'
 0.96953 - ' child'
 0.82568 - ' who'
 0.98809 - ' loved'
 0.88151 - ' nothing'
 1.00000 - ' more'
 1.00000 - ' than'