lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Apache License 2.0
36.52k stars 4.5k forks source link

How to process requests with FastChat api parallelly or in a batch style? #2313

Open BigAndSweet opened 1 year ago

BigAndSweet commented 1 year ago

I have to get vector embedding /do sentimental analysis for millions of senteces. Currently, it seems that I can only process sentence one by one with RESTful API Server launched in the background. It is very slow ad much GPU memory is unoccupied. Is there any method so that I can process senteces with FastChat api parallelly or in a batch style?

karthik19967829 commented 1 year ago

Try doing inference with https://github.com/huggingface/text-generation-inference , it has continuous this should help with higher throughput and reduced latency with caching

Phil-U-U commented 11 months ago

@infwinston Hi Wei-Lin, is TGI the solution you recognized as the one for parallel requests?

infwinston commented 11 months ago

For text generation, we have vLLM support which is a high-throughput inference engine for requests. You can easily launch it following this doc. https://github.com/lm-sys/FastChat/blob/main/docs/vllm_integration.md

For embedding generation, we haven't supported any high throughput engine yet. There is a recent one by HuggingFace. See if it fits your need. https://github.com/huggingface/text-embeddings-inference

Phil-U-U commented 11 months ago

For text generation, we have vLLM support which is a high-throughput inference engine for requests. You can easily launch it following this doc. https://github.com/lm-sys/FastChat/blob/main/docs/vllm_integration.md

For embedding generation, we haven't supported any high throughput engine yet. There is a recent one by HuggingFace. See if it fits your need. https://github.com/huggingface/text-embeddings-inference

Thank you for the reply. I will give it a try. Meanwhile, I should include the link about the Text Generation Inference (TGI) I mentioned above: https://github.com/huggingface/text-generation-inference

Phil-U-U commented 11 months ago

https://github.com/lm-sys/FastChat/blob/main/docs/vllm_integration.md

Hi Wei-Lin,

I ran into the error when pip install vllm: RuntimeError: Cannot find CUDA_HOME. CUDA must be available to build the package.

Does vllm support the Mac/Metal/MPS?

infwinston commented 11 months ago

I don't think so atm. I'd suggest to ask them in their repo. https://github.com/vllm-project/vllm

weiminw commented 11 months ago

For text generation, we have vLLM support which is a high-throughput inference engine for requests. You can easily launch it following this doc. https://github.com/lm-sys/FastChat/blob/main/docs/vllm_integration.md

For embedding generation, we haven't supported any high throughput engine yet. There is a recent one by HuggingFace. See if it fits your need. https://github.com/huggingface/text-embeddings-inference

我按照文档说明使用了vllm_work,但是不知道该如何使用restful api来做批量的推理,是否有代码或者文档给以指引,万分感谢

infwinston commented 11 months ago

You can set up an OpenAI API server with https://github.com/lm-sys/FastChat/blob/main/docs/openai_api.md and then you get a restful API, which you can send requests in parallel with python concurrent.futures, multiprocessing etc.

weiminw commented 11 months ago

You can set up an OpenAI API server with https://github.com/lm-sys/FastChat/blob/main/docs/openai_api.md and then you get a restful API, which you can send requests in parallel with python concurrent.futures, multiprocessing etc. 非常感谢你的指导

wanzhenchn commented 10 months ago

You can set up an OpenAI API server with https://github.com/lm-sys/FastChat/blob/main/docs/openai_api.md and then you get a restful API, which you can send requests in parallel with python concurrent.futures, multiprocessing etc.

I send requests to openai server in parallel with python's multiprocess, however, it seems that the server does't run parallelly. (the time elapsed is almost same with concurrency = 1 or 4)

@infwinston @BigAndSweet

What's the matter? the code below:

python3 -m fastchat.serve.controller \
  --host 0.0.0.0 \
  --port 10011 &

python3 -m fastchat.serve.vllm_worker \
  --model-path ${lora_model_path} \
  --controller-address http://0.0.0.0:10010 \
  --host 0.0.0.0 \
  --port 80 &

python3 -m fastchat.serve.openai_api_server \
  --host 0.0.0.0 \
  --port 82

def get_response(server_addr: str, prompt: str, max_new_tokens: int, model_name: str): headers = {"Content-Type": "application/json"} req = { "model": model_name, "messages": [{"role": "user", "content": prompt}], "max_tokens": max_new_tokens, } res = requests.post(server_addr, headers=headers, json=req, stream=False)

def inference(server_addr: str, max_new_tokens: int, model_name: str, req_queue: mp.Queue, res_queue: mp.Queue): while not req_queue.empty(): prompt = req_queue.get() get_response(server_addr, prompt, max_new_tokens, model_name)

infwinston commented 10 months ago

hmm, not sure what exactly was the problem. But indeed you can just use OpenAI SDK to get the answer, which will simplify the implementation quite a bit. Could you try our code based on concurrent here? https://github.com/lm-sys/FastChat/blob/77932a1eb3dbb385fbf4530f0edc76f1c8c621bc/fastchat/llm_judge/gen_api_answer.py#L128