microsoft / LLMLingua

To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
https://llmlingua.com/
MIT License
4.17k stars 222 forks source link

Support for remote LLM through API #65

Open deltawi opened 5 months ago

deltawi commented 5 months ago

Hi team,

Due to computing resources needed to run this, it would be nice if you can also add an option where user can give url_endpoint and api_key for a remote REST API to use for this instead of downloading the model from HuggingFace.

iofu728 commented 5 months ago

Hi @deltawi, thank you for your interest in and support of LLMLingua.

Currently, since API models do not provide log probabilities for the prompt end, it's challenging to directly support related requirements. However, we will incorporate this need into our future plans.

Refer to issue #44.

bytecod3r commented 4 months ago

Hi @deltawi, thank you for your interest in and support of LLMLingua.

Currently, since API models do not provide log probabilities for the prompt end, it's challenging to directly support related requirements. However, we will incorporate this need into our future plans.

Refer to issue #44.

hi @iofu728 , is this possible to run a model on a server and point the code to use that model over its API? e.g. run a llama2 7b on a server.

afbarbaro commented 1 month ago

Same need here. I love the concepts of LLMLingua and they are super useful for users, however, I do not have the ability to self-host inference for any model (due to many different reasons: cost, know-how, security, capacity, etc.). I leverage Microsoft Azure AI and Fireworks AI and they have models that can apparently be used (small and fast) for LLMLingua. I'd like to have the ability to use an API for the calls that LLMLingua needs.

Any comments on whether this will make it into the roadmap?

iofu728 commented 1 month ago

Same need here. I love the concepts of LLMLingua and they are super useful for users, however, I do not have the ability to self-host inference for any model (due to many different reasons: cost, know-how, security, capacity, etc.). I leverage Microsoft Azure AI and Fireworks AI and they have models that can apparently be used (small and fast) for LLMLingua. I'd like to have the ability to use an API for the calls that LLMLingua needs.

Any comments on whether this will make it into the roadmap?

Hi @afbarbaro, we support the API mode in Prompt flow, you can refer this document to use it.