huggingface / optimum-tpu

Google TPU optimizations for transformers models

Apache License 2.0

66 stars 17 forks source link

Few more Inference Endpoints fixes #69

Closed tengomucho closed 2 months ago

tengomucho commented 2 months ago

What does this PR do?

Fix clear request with an ID (it was causing a crash on server).
Raise an error when there are too many requests (it should never happen, but it's good to handle that).
Add more prefill lengths to warmup. It will take longer, but it will end up in faster inference for shorter prompts, at least until we find a better fix for bucketing and padding not working as expected.
Image version set to 0.1.2 (ready for release).

HuggingFaceDocBuilderDev commented 2 months ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.