Expose `ignore_eos_token` to HTTP endpoints

nathan-az commented 6 months ago

Feature request

I would like ignore_eos_token specifically built into TGI for benchmarking to become available to one of the HTTP endpoints (e.g. as an optional field in parameter defaulting to false in generate).

Motivation

The internal comments indicate this is used for benchmarking https://github.com/huggingface/text-generation-inference/blob/954653466d24a9b3435988136983398bdf788a2f/proto/generate.proto#L91.

For the same reason, we want to be able to benchmark and stress test different model/hardware/configuration options to optimise for latency/concurrency/input length/output length.

The ability to fix output length by disabling the EOS token and managing total tokens (or new tokens) is much preferable in this case.

This is possible in vLLM, and is (probably) possible in OpenAI's endpoints by tweaking logit_bias to downweight the EOS token (although I haven't tried this). That is to say - I don't think it's an antipattern to allow something like this.

Your contribution

I'm happy to submit a PR here, but it seems this has been ignored and closed before. I'd like to know we're open to this before making a PR.

meitalbensinai commented 5 months ago

yes, please!

varad0309 commented 4 months ago

@nathan-az were you able to resolve this issue?

nathan-az commented 4 months ago

@varad0309 We've begun using the benchmark tool. ignore_eos_token would still be a nice-to-have for forcing long output sequences, but the benchmark tool serves its purpose well!

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

huggingface / text-generation-inference