Open BigAndSweet opened 1 year ago
Try doing inference with https://github.com/huggingface/text-generation-inference , it has continuous this should help with higher throughput and reduced latency with caching
@infwinston Hi Wei-Lin, is TGI the solution you recognized as the one for parallel requests?
For text generation, we have vLLM support which is a high-throughput inference engine for requests. You can easily launch it following this doc. https://github.com/lm-sys/FastChat/blob/main/docs/vllm_integration.md
For embedding generation, we haven't supported any high throughput engine yet. There is a recent one by HuggingFace. See if it fits your need. https://github.com/huggingface/text-embeddings-inference
For text generation, we have vLLM support which is a high-throughput inference engine for requests. You can easily launch it following this doc. https://github.com/lm-sys/FastChat/blob/main/docs/vllm_integration.md
For embedding generation, we haven't supported any high throughput engine yet. There is a recent one by HuggingFace. See if it fits your need. https://github.com/huggingface/text-embeddings-inference
Thank you for the reply. I will give it a try. Meanwhile, I should include the link about the Text Generation Inference (TGI) I mentioned above: https://github.com/huggingface/text-generation-inference
https://github.com/lm-sys/FastChat/blob/main/docs/vllm_integration.md
Hi Wei-Lin,
I ran into the error when pip install vllm: RuntimeError: Cannot find CUDA_HOME. CUDA must be available to build the package.
Does vllm support the Mac/Metal/MPS?
I don't think so atm. I'd suggest to ask them in their repo. https://github.com/vllm-project/vllm
For text generation, we have vLLM support which is a high-throughput inference engine for requests. You can easily launch it following this doc. https://github.com/lm-sys/FastChat/blob/main/docs/vllm_integration.md
For embedding generation, we haven't supported any high throughput engine yet. There is a recent one by HuggingFace. See if it fits your need. https://github.com/huggingface/text-embeddings-inference
我按照文档说明使用了vllm_work,但是不知道该如何使用restful api来做批量的推理,是否有代码或者文档给以指引,万分感谢
You can set up an OpenAI API server with
https://github.com/lm-sys/FastChat/blob/main/docs/openai_api.md
and then you get a restful API, which you can send requests in parallel with python concurrent.futures
, multiprocessing
etc.
You can set up an OpenAI API server with https://github.com/lm-sys/FastChat/blob/main/docs/openai_api.md and then you get a restful API, which you can send requests in parallel with python
concurrent.futures
,multiprocessing
etc. 非常感谢你的指导
You can set up an OpenAI API server with https://github.com/lm-sys/FastChat/blob/main/docs/openai_api.md and then you get a restful API, which you can send requests in parallel with python
concurrent.futures
,multiprocessing
etc.
I send requests to openai server in parallel with python's multiprocess, however, it seems that the server does't run parallelly. (the time elapsed is almost same with concurrency = 1 or 4)
@infwinston @BigAndSweet
What's the matter? the code below:
python3 -m fastchat.serve.controller \
--host 0.0.0.0 \
--port 10011 &
python3 -m fastchat.serve.vllm_worker \
--model-path ${lora_model_path} \
--controller-address http://0.0.0.0:10010 \
--host 0.0.0.0 \
--port 80 &
python3 -m fastchat.serve.openai_api_server \
--host 0.0.0.0 \
--port 82
res_que = mp.Queue()
_start = time.perf_counter()
for i in range(concurrency):
proc = mp.Process(target=inference,
args=(server_addr, max_new_tokens, model_name, req_queue, res_que))
procs.append(proc)
proc.start()
for proc in procs:
proc.join()
_end = time.perf_counter()
elapsed_time = _end - _start
def get_response(server_addr: str, prompt: str, max_new_tokens: int, model_name: str): headers = {"Content-Type": "application/json"} req = { "model": model_name, "messages": [{"role": "user", "content": prompt}], "max_tokens": max_new_tokens, } res = requests.post(server_addr, headers=headers, json=req, stream=False)
def inference(server_addr: str, max_new_tokens: int, model_name: str, req_queue: mp.Queue, res_queue: mp.Queue): while not req_queue.empty(): prompt = req_queue.get() get_response(server_addr, prompt, max_new_tokens, model_name)
hmm, not sure what exactly was the problem. But indeed you can just use OpenAI SDK to get the answer, which will simplify the implementation quite a bit. Could you try our code based on concurrent
here?
https://github.com/lm-sys/FastChat/blob/77932a1eb3dbb385fbf4530f0edc76f1c8c621bc/fastchat/llm_judge/gen_api_answer.py#L128
I have to get vector embedding /do sentimental analysis for millions of senteces. Currently, it seems that I can only process sentence one by one with RESTful API Server launched in the background. It is very slow ad much GPU memory is unoccupied. Is there any method so that I can process senteces with FastChat api parallelly or in a batch style?