huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.73k stars 1.01k forks source link

Support for no_repeat_ngram_size generation parameter #2219

Closed njbrake closed 2 weeks ago

njbrake commented 1 month ago

Feature request

The Transformers library supports the no_repeat_ngram_size parameter for generation. https://huggingface.co/docs/transformers/v4.18.0/en/main_classes/text_generation#transformers.generation_utils.GenerationMixin.generate.no_repeat_ngram_size Could this be supported in the TGI library?

Motivation

This parameter helps to cut down on repetition and helps generate more concise responses.

Your contribution

I would be happy to help implement this if I can get some pointers about what code needs to be adjusted to accommodate the change? It's unclear to me whether the change needs to be implemented in the Rust layer, the Python layer, or both.

I'd like the parameter to be supported for Llama type models (Llama, Mistral, Phi, etc).

Thanks!

ErikKaum commented 1 month ago

Hi @njbrake 👋

I think it would be a welcomed contribution if you have the bandwidth 👍

First would be to implement the args parsing in the text-generation-launcher: https://github.com/huggingface/text-generation-inference/blob/dbb23fbfa868ad8f961c60896e346fad3d2ab440/launcher/src/main.rs#L170

That should ripple down when starting the text-generation-server: https://github.com/huggingface/text-generation-inference/blob/dbb23fbfa868ad8f961c60896e346fad3d2ab440/launcher/src/main.rs#L679

And then in the text-generation-server (which is python) actually have the implementation: https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/utils/logits_process.py

Will this help you getting started?

njbrake commented 1 month ago

Thanks! Yes that helps me get started. The first two steps you reference (In main.rs) appear to refer to a command to pass when launching the server. no_repeat_ngram_size is a generation configuration parameter that should be passed by the client when running an inference request. Do you have any advice about where I need to look for that? It seems like the text-generation-launcher isn't the right place. For example, I would expect to be able to do something like:

from huggingface_hub import InferenceClient
client = InferenceClient()
output = client.chat_completion(
    messages="Finish this sentence....",
    max_tokens=512,
    temperature=0.1,
    no_repeat_ngram_size=7
)
ErikKaum commented 1 month ago

Ah ofc, sorry I was too fast with that 👍

So then this is where the server request is parsed (depending on the endpoint): https://github.com/huggingface/text-generation-inference/blob/dbb23fbfa868ad8f961c60896e346fad3d2ab440/router/src/server.rs#L557

Where the struct is e.g. like this: https://github.com/huggingface/text-generation-inference/blob/dbb23fbfa868ad8f961c60896e346fad3d2ab440/router/src/lib.rs#L383

I guess the python part in the logits processor is nonetheless valid. The request is then passed to the python server through the grpc layer, and the protobuf is defined here: https://github.com/huggingface/text-generation-inference/blob/dbb23fbfa868ad8f961c60896e346fad3d2ab440/proto/generate.proto#L60

github-actions[bot] commented 3 weeks ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

njbrake commented 3 weeks ago

Hi! @ErikKaum should I let this ticket be closed since we have that PR in progress?

ErikKaum commented 3 weeks ago

Hi @njbrake!

Actually, I unfortunately have some bad news regarding the feature. After a lot of discussions internally, we think it would be best not to merge the PR and therefore not support this feature.

First off I understand this not the news you'd prefer to hear. And I hope it doesn't discourage you from continuing being active in the community. I also want to thank you for the time and effort you put in. It's definitely on us that we didn't have the discussion earlier. So I'd like to apologies for that.

So the reason for discarding this is first and foremost performance. We noticed that calculating n-grams had a noticeable penalty on latency, especially with larger n. The second was to keep the logit processing logic as simple as possible. We understand the desire to guide LLMs to have more variance in their output and there are several ways to achieve that.

Hopefully this makes sense to you as well. And again, thank you for all the time and effort you put in 🙌

If it's okay for you, I'll close the issue and PRs. And we're of course open to reconsider if there's a bigger demand for this specific feature 👍

njbrake commented 3 weeks ago

Hi @ErikKaum no problem, thank you for the comprehensive response and explanation. Not the outcome I hoped for but your reasoning makes sense 👍🏼 thanks!