Text-generation-inference (TPU) container fixes

Michellehbn commented 1 week ago

As part of the support of TPU in Inference Endpoints and for a better user experience:

Resolve hanging on the server side
Very small generation length

cc @tengomucho @mfuntowicz

tengomucho commented 6 days ago

This can be separated in several smaller tasks. I'll list them here to follow up progress.

[x] if "health" is called before any prefill call, it hangs.
[x] "warmup" fails. This is because it tries to do prefill with a very long sequence (ignoring truncate info in the request), and apparently that ends up in a RESOURCE_EXHAUSTED error.
[x] I believe "warmup" isn't really doing anything to make prefill/decode smoother afterwards in TPU, and it's just taking time compiling and making the inference, filling TPU memory uselessly. We might consider better handling warmup calls.
[x] sequence length and parameters are low. We should investigate if we could increase this by bucketing or what is causing this issue.

I have now fixed the health issue. The problem was a wrong CachedBatch serialization. progrss is in branch debug-tgi-ie.

tengomucho commented 3 days ago

Daily update : warmup now works on the branch and the truncate works too. I am currently working on increasing the input length, trying to do that by bucketing prefilled inputs.

tengomucho commented 1 day ago

I have almost fixed everything, I do truncate as it should and I do bucketing and warmup. But I also introduced a bug, because I padded wrongly when bucketing prefills. I will fix that tomorrow.

huggingface / optimum-tpu

Text-generation-inference (TPU) container fixes #65