Closed nathan-az closed 3 months ago
yes, please!
@nathan-az were you able to resolve this issue?
@varad0309 We've begun using the benchmark tool. ignore_eos_token
would still be a nice-to-have for forcing long output sequences, but the benchmark tool serves its purpose well!
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Feature request
I would like
ignore_eos_token
specifically built into TGI for benchmarking to become available to one of the HTTP endpoints (e.g. as an optional field inparameter
defaulting tofalse
ingenerate
).Motivation
The internal comments indicate this is used for benchmarking https://github.com/huggingface/text-generation-inference/blob/954653466d24a9b3435988136983398bdf788a2f/proto/generate.proto#L91.
For the same reason, we want to be able to benchmark and stress test different model/hardware/configuration options to optimise for latency/concurrency/input length/output length.
The ability to fix output length by disabling the EOS token and managing total tokens (or new tokens) is much preferable in this case.
This is possible in vLLM, and is (probably) possible in OpenAI's endpoints by tweaking logit_bias to downweight the EOS token (although I haven't tried this). That is to say - I don't think it's an antipattern to allow something like this.
Your contribution
I'm happy to submit a PR here, but it seems this has been ignored and closed before. I'd like to know we're open to this before making a PR.