Open zcbenz opened 1 month ago
Thanks for flagging. Indeed the way we do streaming decode in the T5 example is not correct for most tokenizers (you can't typically decode each new token individually as we do here). It should either be a proper streaming decoder or we just eat the quadratic cost and redecode the entire prefix.
Will mark this as a bug, should be a fairly simple fix.
The
hf_t5.py
can do correct output with changes:It seems that the tokenizer does not work well with streaming decoding.