turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.73k stars 283 forks source link

[BUG] `generator.iterate()` returns corrupted result objects in some cases #689

Open p-e-w opened 4 days ago

p-e-w commented 4 days ago

OS

Linux

GPU Library

CUDA 12.x

Python version

3.12

Pytorch version

2.5.0

Model

No response

Describe the bug

This is an actual result object I received from generator.iterate():

{
  "job": ExLlamaV2DynamicJob 941,
  "stage": "streaming",
  "eos": False,
  "serial": 941,
  "identifier": 941,
}

As you can see, eos is False, but the fields text, token_ids, etc. are all missing. Thus this "result" object contains no information at all. I have not been able to determine yet whether there are tokens missing from the output because of this, or this is a non-event that gets emitted for some reason.

This is a VERY rare occurrence. I have to start hundreds of jobs and generate between 10,000 and 50,000 tokens in order for this to happen. Perhaps a race condition somewhere in the threading logic?

Reproduction steps

Any use of the dynamic generator appears to trigger this eventually, provided enough jobs are started and enough tokens are generated. Just place a check like

if result["stage"] == "streaming":
    assert result["eos"] or ("text" in result)

in the loop and eventually you will get an error.

Expected behavior

If EOS has not been reached, results should contain token data.

Logs

No response

Additional context

No response

Acknowledgements

turboderp commented 1 day ago

Not every iteration will emit text data. Every iteration processes one token per active job, but one token doesn't always result in text output. This can happen if there's a partial match to a banned string or stop string, which will hold any text until the condition is resolved. There are also plenty of characters that are encoded as multiple tokens, in which case you may only get text on every 2nd or 4th iteration or whatever.

I would simply use result.get("text", "") if you want to interpret zero tokens as an empty string.

p-e-w commented 1 day ago

But then why is the result object emitted at all? AFAICT, such objects contain no actionable information. It's not even possible to tell which of the conditions you describe has occurred. What is the user supposed to do with such a "result"?

turboderp commented 1 day ago

It's a design choice.

The generator performs one iteration per call to iterate(), which is one forward pass through the model with a sampling step at the end. Some iterations produce one or more characters of text, and some result in less than one character.

It's technically possible to return less than one character (as a byte array containing incomplete UTF-8 codes), but I don't think the Tokenizers library allows for it in a clean way and it would complicate client code a lot having to deal with that.

The other option is for iterate() to run an arbitrary number of iterations until there is decoded text to emit. This would complicate the control flow a bit, though, and make timing less predictable from the client's perspective. It's also not clear how to deal with the case where you have two or more concurrent jobs, and one of them samples the first two bytes of a three-byte emoji. Should all the other jobs stall while the one with the incomplete character runs a single pass at bsz 1? Or should they all sample one more token in that case? What if two jobs get out of phase when generating strings of Chinese characters that are two tokens each? This could lead to long stretches of no output at all.

Alternatively, an iteration could simply not return a result for a job that doesn't produce output, as opposed to returning an empty result. To me these seem roughly equivalent, but I went with the latter because it at least is a way for the client to confirm that a job is running and making progress, even when that progress hasn't yet produced a decodable string.