RouteLLM Useage instrauctions please

pirouzkhakzad commented 1 month ago

Hi There, first, I want to thank you for this great project. I have successfully installed and configured RouteLLM on my machine, but I cannot find any information on how to execute it. Would you kindly provide some examples on how to use the tool? In particular i want an example on how to pass the prompt and get back the “model name”.

Thanks so much,

Pirouz

Sunt-ing commented 1 month ago

+1. I also would like to see minimal examples of using the routing models so that users can integrate these router models into other projects.

iojw commented 1 month ago

Hi! The main way to interface with RouteLLM in your applications is through our OpenAI-compatible server, which you can run locally or in the cloud.

For example, if I want to run the MF router locally with the default model pair (GPT-4 / Mixtral 8x7B), I first launch the server:

> python -m routellm.openai_server --routers mf --config config.example.yaml 
Launching server with routers: ['mf']
INFO:     Started server process [92737]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:6060 (Press CTRL+C to quit)

Next, I want to calibrate my threshold for this router so I know what threshold to use for routing. If you have some knowledge about the type of queries you are going to serve, you can get a more accurate threshold by calibrating on that dataset using the calibrate_threshold script. In this case, I'm going to calibrate based on the publicly-available LMSYS dataset (https://huggingface.co/datasets/lmsys/lmsys-arena-human-preference-55k). Say I want approximately 50% of my calls to be routed to GPT-4, managing my cost while maximizing quality.
```
> python -m routellm.calibrate_threshold --task calibrate --routers mf causal_llm bert --strong-model-pct 0.5 --config config.example.yaml
For 50.0% strong model calls, calibrated threshold for mf: 0.11592505872249603
```

This means that I'll want to use 0.116 as my threshold to get approximately 50% of calls routed to GPT-4. (Note that if your input queries differ a lot from the dataset used to calibrate, then the % of calls routed to GPT-4 can differ, so you want to calibrate on a dataset closest to the type of queries you will receive).

Now, I can use it in my Python application to generate completions just like I would any OpenAI model.
```
import openai
client = openai.OpenAI(
base_url="https://localhost:6060/v1",
api_key="no_api_key"
)
response = client.chat.completions.create(
model="router-mf-0.116",
messages=[
    {"role": "user", "content": "What is a roux?"}
]
)
print(response.choices[0].message.content)
```
Here, I'm setting the model to router-mf-0.116 to mean: "I want to use the MF router with a threshold of 0.116". Importantly, the server that we have launched will also work with any other application that uses the OpenAI endpoint - you just need to update the base URL.

Thanks to your feedback, I've also added the above step-by-step example to the repo to make it easier for others: https://github.com/lm-sys/RouteLLM/blob/main/docs/minimal_walkthrough.md

I hope this helps! Let me know if you have any other questions or suggestions on what you would like to see.

Sunt-ing commented 1 month ago

Thanks for your example!

Is there any way to use an embedded router model, i.e., calling the router model by a function call rather than an HTTP request?

iojw commented 1 month ago

Do you have any example of how you plan to use it? Would like to better understand the use case.

Sunt-ing commented 1 month ago

Yes, we are going to use it in a local client to decide whether to use a local model or a remote model. I therefore think removing the requirement of an OpenAI-compatible server will allow higher flexibility and broader use cases of the router.

RoyalMamba commented 1 month ago

Please provide a detailed documentation. I tried http://0.0.0.0:6060/v1/chat/completions through postman, with this in my body

{
  "model": "router-mf-0.116",
  "messages" : [
      {"role": "user", "content": "What is a roux?"}
  ]
}

I guess it might be some key error.

iojw commented 1 month ago

@RoyalMamba Can you provide the error that you receive?

@Sunt-ing Thank you for the feedback! I'm considering what would be the best interface for this. Would an API that takes in a prompt, and return the name of the model to route to work for your use case?

Sunt-ing commented 1 month ago

Yes, it perfectly fits my use case. Thanks!

Sunt-ing commented 1 month ago

Also, in my case, it will be better if the model can be executed by either CPU or GPU, as GPU may not always be available to the router model while the router model is relatively lightweight.

On Sat, Jul 6, 2024, 02:45 Isaac Ong @.***> wrote:

@RoyalMamba https://github.com/RoyalMamba Can you provide the error that you receive?

@Sunt-ing https://github.com/Sunt-ing Thank you for the feedback! I'm considering what would be the best interface for this. Would an API that takes in a prompt, and return the name of the model to route to work for your use case?

— Reply to this email directly, view it on GitHub https://github.com/lm-sys/RouteLLM/issues/4#issuecomment-2211280770, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKIL3E6XXNZKAMDUBVPQUMLZK3ST5AVCNFSM6AAAAABKK3ZAESVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJRGI4DANZXGA . You are receiving this because you were mentioned.Message ID: @.***>

iojw commented 1 month ago

@Sunt-ing Could you share more about why you are routing between a remote and local model? Would love to understand the use case better as we figure out how to extend this framework :)

Sunt-ing commented 1 month ago

If local model is not good enough, we use the remote model, which is more expensive but accurate.

On Sat, Jul 6, 2024, 05:01 Isaac Ong @.***> wrote:

@Sunt-ing https://github.com/Sunt-ing Could you share more about why you are routing between a remote and local model? Would love to understand the use case better as we figure out how to extend this framework :)

— Reply to this email directly, view it on GitHub https://github.com/lm-sys/RouteLLM/issues/4#issuecomment-2211408476, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKIL3E2AEWIUUPJFBZLU5GDZK4CTZAVCNFSM6AAAAABKK3ZAESVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJRGQYDQNBXGY . You are receiving this because you were mentioned.Message ID: @.***>

iojw commented 1 month ago

@Sunt-ing We've just added support for interfacing with RouteLLM directly from Python without a server. Please check out the docs for details!

Regarding your previous comment, yes, all models should default to running on your CPU when a GPU isn't available.

Let me know you have any questions or suggestions!

iojw commented 1 month ago

@RoyalMamba @pirouzkhakzad Please refer to our updated README, which includes clearer instructions on how to get this working.

Closing this for now, let me know if you face any other issues!

Sunt-ing commented 1 month ago

Hi @iojw , I find that OPENAI_API_KEY is always required even when using SDK with non-OpenAI models. Is it possible to minimize the requirement to OPENAI_API_KEY, i.e., not requiring OPENAI_API_KEY when OPENAI_API_KEY will not be actually used?

Sunt-ing commented 1 month ago

I guess the internal code has several dependencies on OpenAI, but from my personal perspective, such dependencies should not be compulsory.

Here is my code and error message. I want to evaluate the router with open models as I don't have a real OpenAI API key. Thanks!

import os
os.environ["OPENAI_API_KEY"] = "sk-XXXXXX"
os.environ["ANYSCALE_API_KEY"] = "esecret_XXXXXX"

# from routellm.controller import Controller
from routellm.controller import Controller

client = Controller(
  # List of routers to initialize
  routers=["mf"],
  # The pair of strong and weak models to route to
  strong_model="meta-llama/Llama-2-13b-chat-hf",
  weak_model="meta-llama/Llama-2-7b-chat-hf",
  # The config for the router (best-performing config by default)
  config = {
    "mf": {
      "checkpoint_path": "routellm/mf_gpt4_augmented"
    }
  },
  # Override API base and key for LLM calls
  api_base=None,
  api_key=None,
  # Display a progress bar for operations
  progress_bar=False,
)

response = client.chat.completions.create(
  # This tells RouteLLM to use the MF router with a cost threshold of 0.11593
  model="router-mf-0.11593",
  messages=[
    {"role": "user", "content": "Hello!"}
  ]
)
print(response.choices[0]["message"]["content"])

Traceback (most recent call last):
  File "/local/scratch/f/xiaoze/ecogen/tools/router.py", line 27, in <module>
    response = client.chat.completions.create(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local/scratch/f/xiaoze/RouteLLM/routellm/controller.py", line 150, in completion
    kwargs["model"] = self._get_routed_model_for_completion(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local/scratch/f/xiaoze/RouteLLM/routellm/controller.py", line 111, in _get_routed_model_for_completion
    routed_model = self.routers[router].route(prompt, threshold, self.model_pair)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local/scratch/f/xiaoze/RouteLLM/routellm/routers/routers.py", line 42, in route
    if self.calculate_strong_win_rate(prompt) >= threshold:
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local/scratch/f/xiaoze/RouteLLM/routellm/routers/routers.py", line 239, in calculate_strong_win_rate
    winrate = self.model.pred_win_rate(
              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local/scratch/e/xiaoze/miniconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/local/scratch/f/xiaoze/RouteLLM/routellm/routers/matrix_factorization/model.py", line 124, in pred_win_rate
    logits = self.forward([model_a, model_b], prompt)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local/scratch/f/xiaoze/RouteLLM/routellm/routers/matrix_factorization/model.py", line 113, in forward
    OPENAI_CLIENT.embeddings.create(input=[prompt], model=self.embedding_model)
  File "/local/scratch/e/xiaoze/miniconda3/lib/python3.11/site-packages/openai/resources/embeddings.py", line 114, in create
    return self._post(
           ^^^^^^^^^^^
  File "/local/scratch/e/xiaoze/miniconda3/lib/python3.11/site-packages/openai/_base_client.py", line 1240, in post
    return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local/scratch/e/xiaoze/miniconda3/lib/python3.11/site-packages/openai/_base_client.py", line 921, in request
    return self._request(
           ^^^^^^^^^^^^^^
  File "/local/scratch/e/xiaoze/miniconda3/lib/python3.11/site-packages/openai/_base_client.py", line 1020, in _request
    raise self._make_status_error_from_response(err.response) from None
openai.AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: sk-XXXXXX. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}

iojw commented 1 month ago

@Sunt-ing Hi there, for the MF router, we still need an OpenAI key currently for generating embeddings. If you'd like to use a router than does not require the OpenAI key at all, you can try the bert router instead.

I'll update the README to make this clearer. Also, we are aware of the desire to have fully local routing for models like MF, and we are looking into this!

lm-sys / RouteLLM

RouteLLM Useage instrauctions please #4