Open p-e-w opened 4 days ago
Not every iteration will emit text data. Every iteration processes one token per active job, but one token doesn't always result in text output. This can happen if there's a partial match to a banned string or stop string, which will hold any text until the condition is resolved. There are also plenty of characters that are encoded as multiple tokens, in which case you may only get text on every 2nd or 4th iteration or whatever.
I would simply use result.get("text", "")
if you want to interpret zero tokens as an empty string.
But then why is the result object emitted at all? AFAICT, such objects contain no actionable information. It's not even possible to tell which of the conditions you describe has occurred. What is the user supposed to do with such a "result"?
It's a design choice.
The generator performs one iteration per call to iterate()
, which is one forward pass through the model with a sampling step at the end. Some iterations produce one or more characters of text, and some result in less than one character.
It's technically possible to return less than one character (as a byte array containing incomplete UTF-8 codes), but I don't think the Tokenizers library allows for it in a clean way and it would complicate client code a lot having to deal with that.
The other option is for iterate()
to run an arbitrary number of iterations until there is decoded text to emit. This would complicate the control flow a bit, though, and make timing less predictable from the client's perspective. It's also not clear how to deal with the case where you have two or more concurrent jobs, and one of them samples the first two bytes of a three-byte emoji. Should all the other jobs stall while the one with the incomplete character runs a single pass at bsz 1? Or should they all sample one more token in that case? What if two jobs get out of phase when generating strings of Chinese characters that are two tokens each? This could lead to long stretches of no output at all.
Alternatively, an iteration could simply not return a result for a job that doesn't produce output, as opposed to returning an empty result. To me these seem roughly equivalent, but I went with the latter because it at least is a way for the client to confirm that a job is running and making progress, even when that progress hasn't yet produced a decodable string.
OS
Linux
GPU Library
CUDA 12.x
Python version
3.12
Pytorch version
2.5.0
Model
No response
Describe the bug
This is an actual result object I received from
generator.iterate()
:As you can see,
eos
isFalse
, but the fieldstext
,token_ids
, etc. are all missing. Thus this "result" object contains no information at all. I have not been able to determine yet whether there are tokens missing from the output because of this, or this is a non-event that gets emitted for some reason.This is a VERY rare occurrence. I have to start hundreds of jobs and generate between 10,000 and 50,000 tokens in order for this to happen. Perhaps a race condition somewhere in the threading logic?
Reproduction steps
Any use of the dynamic generator appears to trigger this eventually, provided enough jobs are started and enough tokens are generated. Just place a check like
in the loop and eventually you will get an error.
Expected behavior
If EOS has not been reached, results should contain token data.
Logs
No response
Additional context
No response
Acknowledgements