Generation kwargs assignment when processing a request

ChenlongDeng commented 2 months ago

Hello, thanks for your good work! Text-generation-inference (tgi) supports the deployment of non-core model according to the official documents:

https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/non_core_models TGI supports various LLM architectures (see full list here). If you wish to serve a model that is not one of the supported models, TGI will fallback to the transformers implementation of that model. This means you will be unable to use some of the features introduced by TGI, such as tensor-parallel sharding or flash attention. However, you can still get many benefits of TGI, such as continuous batching or streaming outputs.

It seems like we will use the transformers implementation for our non-core model. Therefore, I want to set some generation kwargs in generate() function to align to the original implementation of my model, but I can't find any entrance about it. Could you tell me how to achieve it? Thanks!

ErikKaum commented 2 months ago

Hi @ChenlongDeng 👋

That's a really good question. So I'm pretty sure that it unfortunately isn't possible. Especially if these are kwargs that are sent in with the request. Or are these added when the model is started? 🤔

Edit: by possible I mean that it isn't supported as is now.

ChenlongDeng commented 2 months ago

Hi @ChenlongDeng 👋

That's a really good question. So I'm pretty sure that it unfortunately isn't possible. Especially if these are kwargs that are sent in with the request. Or are these added when the model is started? 🤔

Edit: by possible I mean that it isn't supported as is now.

Thanks for your response! My problem is that if we serve one model with its config file, we can't assign these parameters in the config by sending a new request. I know it is challenging and may conflict with the design, and I believe TGI will work more seamlessly with transformers if TGI can support it.

huggingface / text-generation-inference

Generation kwargs assignment when processing a request #2447