microsoft / LLMLingua

To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
https://llmlingua.com/
MIT License
4.42k stars 241 forks source link

Using web-hosted model for inference #44

Open dnnp2011 opened 8 months ago

dnnp2011 commented 8 months ago

Currently the NousResearch/Llama-2-7b-chat-hf model appears to be running locally on my machine, which can take quite a while for long prompts. I'd like to use more AI-optimized hardware to speed this process up.

Is it possible to use a web-hosted version of the model, or use a different web-hosted model entirely?

iofu728 commented 8 months ago

Hi @dnnp2011, thank you for your support with LLMLingua.

In fact, you can use any web-hosted version of models, as long as it provides an interface similar to 'logprob' for calculating perplexity

dnnp2011 commented 8 months ago

Thanks for getting back to me @iofu728

How exactly do I implement this in practice? I'm not clear on how to pass any HuggingFace or OpenAI API details to define the model host and pass along any API keys. The only reference to something like this I've seen is using OpenAI embeddings for the re-ranking step.

iofu728 commented 8 months ago

Hi @dnnp2011,

Sorry, the fact is that it's currently not possible to use the web-host API for this purpose, as we can't obtain the log probabilities of the prompt part through the web-hoster API. Previously, it was feasible to get the log probabilities of the prompt by calling the OpenAI API and setting max_token=0. Therefore, unless there is an API available that provides the log probabilities for the prompt, we can only implement this through self-deployed models.

snarb commented 7 months ago

@iofu728 can we use old openai api version? Do you in what version it was?

iofu728 commented 7 months ago

@iofu728 can we use old openai api version? Do you in what version it was?

After our confirmation, some OAI models can obtain log probabilities from the prompt side. You can refer to the following code:

logp = openai.Completion.create(
    model="davinci-002",
    prompt="Please return the logprobs",
    logprobs=0,
    max_tokens=0,
    echo=True,
    temperature=0,
)
Out[3]:
<OpenAIObject text_completion id=-at > JSON: {
  "id": "",
  "object": "text_completion",
  "created": 1707295146,
  "model": "davinci-002",
  "choices": [
    {
      "text": "Please return the logprobs",
      "index": 0,
      "logprobs": {
        "tokens": [
          "Please",
          " return",
          " the",
          " log",
          "pro",
          "bs"
        ],
        "token_logprobs": [
          null,
          -6.9668007,
          -2.047512,
          -8.885729,
          -13.960022,
          -5.479665
        ],
        "top_logprobs": null,
        "text_offset": [
          0,
          6,
          13,
          17,
          21,
          24
        ]
      },
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 6,
    "total_tokens": 6
  }
}
DumoeDss commented 6 months ago

https://github.com/vllm-project/vllm/discussions/1203 Hey, I was wondering if this would be useful? vllm's openai interface provides results for logprobs. I think this issue could also be implemented through the vllm interface (letting the user choose to use the llm model for the corresponding language)

DumoeDss commented 6 months ago

https://github.com/lm-sys/FastChat/pull/2612 And the fastchat server support it too.

iofu728 commented 6 months ago

Hi @DumoeDss,

Thank you for your information. It seems very useful, especially FastChat, which appears to support echo, enabling the return of logprobs from the prompt side. We will consider using the relevant engine in the future. If you are willing to do some adaptations, we would greatly welcome it.

DumoeDss commented 6 months ago

@iofu728 I'd be happy to try to do it, but I'd have to dive into the source code first, and I'm not sure how to start yet.

iofu728 commented 6 months ago

@iofu728 I'd be happy to try to do it, but I'd have to dive into the source code first, and I'm not sure how to start yet.

Hi @DumoeDss, the core issue involves implementing the self.get_ppl function through web API calls. Please take a look at the relevant code, and if you need any assistance, feel free to reply.

DumoeDss commented 6 months ago

@iofu728 I tried outputting logprobs with fastchat/vllm and ran into some more troublesome situations during the pre-processing.

First of all the two issues/prs I mentioned above on github both apply to the completion interface and don't support the chatcompletion interface. The pr's for vllm support chatcompletion, but after trying them out I realized that they don't work very well.

The instruction I use is "Please repeat the following and do not output anything else: content".

I tried using the models yi-34B-chat, qwen1.5-0.5B-chat, qwen1.5-1.8B-chat, qwen1.5-4B-chat, and qwen1.5-7B-chat, and the output of the models above 4B is slightly more satisfactory, but there are cases where the output is not in accordance with the original text, which results in unable to calculate the original logprobs. But even with the 4B model, for a 3000+ token length content, it takes 20s to output it in full, while it takes less than 400ms to calculate it directly using the 0.5B model. I don't know if I'm doing anything wrong, I think using the chat model output to calculate logprobs might not be a step in the right direction.

There is a modification here where I added the interface using fastapi, which might be an acceptable solution.

I sent you an email to continue to communicate again~

iofu728 commented 6 months ago

Hi @DumoeDss,

Thank you for your help. However, there seems to be an issue with the API call parameters. You can refer to the following:

logp = openai.Completion.create(
    model="davinci-002",
    prompt="Please return the logprobs",
    logprobs=0,
    max_tokens=0,
    echo=True,
    temperature=0,
)

By setting max_tokens to 0 and echo to True, the model will not generate new tokens but will return the logprobs of the prompt side. I briefly checked, and FastChat should support this. If you have more questions, feel free to ask.

codylittle commented 6 months ago

To join on here, support for models hosted through the Azure AI Studio would be fantastic too. And a Typescript library too (;