h2oai / h2ogpt

Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://gpt-docs.h2o.ai/
http://h2o.ai
Apache License 2.0
11.42k stars 1.25k forks source link

Issue with Concurrent Query Processing and Document Upload #1848

Open llmwesee opened 2 months ago

llmwesee commented 2 months ago

I have implemented a solution using vLLM on an A100 server to support multiple users. However, I have encountered an issue:

While one user's query is being processed, other users are unable to upload documents into the UserData or MyData collections. The document upload process gets stuck at the processing stage without any errors appearing in the terminal or UI. Additionally, the document is not uploaded successfully.

Can you suggest ways to decouple the query processing, document upload, and user interface programs so they can run independently of each other?

Alternatively, can we build or use prebuilt separate APIs to manage program in the backend? Please provide suggestions or potential solutions.

pseudotensor commented 1 month ago

They should all be independent unless you changed CONCURRENCY_COUNT to be 1. This is tested normally. The backend has no issues with this at all.

pseudotensor commented 1 month ago

Once you have that working, I can explain how to make it even more efficient using the function_server.

llmwesee commented 1 month ago

this is the command for running h2ogpt with login. python generate.py --base_model=meta-llama/Meta-Llama-3.1-8B-Instruct --score_model=None --langchain_mode='UserData' --user_path=user_path --auth='' --use_auth_token=True --visible_visible_models=False --max_seq_len=8192 --max_max_new_tokens=4096 --max_new_tokens=4096 --min_new_tokens=256 can you show me some examples for having h2ogpt as fully backend server running with full functionality from query processing to document uploading for multiple users concurrently & independently . I want to integrated it's backend with react or next.js as frontend with having full functionality like as h2ogpt and having a datalake for all related document things

pseudotensor commented 1 month ago

I'd guess I'd need to ask how you see things blocked. E.g. if you had a pytest test code that you are running that shows how things are blocking each other (e.g. long add of dock and then chat is blocked in another test you ran with -n 2) or you just show video of the UI and what you are doing, I can mimic it and see if I can see what you are seeing.

pseudotensor commented 1 month ago

As for the function server, you can try it. Just add to CLI:

 --function_server=True --function_server_workers=5 --multiple_workers_gunicorn=True --function_server_port=5002 --function_api_key=API_KEY
llmwesee commented 1 month ago

the function server has issue when hitting through upload_apiand add_file_api

Traceback (most recent call last):
  File "/home/abc/Documents/xxxx/xxxx/src/gpt_langchain.py", line 9383, in update_user_db
    return _update_user_db(file, db1s=db1s,
  File "/home/xxxx/src/gpt_langchain.py", line 9664, in _update_user_db
    sources = call_function_server('0.0.0.0', function_server_port, 'path_to_docs', (file,), simple_kwargs,
  File "/home/xxxx/src/function_client.py", line 50, in call_function_server
    execute_result = execute_function_on_server(host, port, function_name, args, kwargs, use_disk, use_pickle,
  File "/home/xxxx/src/function_client.py", line 21, in execute_function_on_server
    response = requests.post(url, json=payload, headers=headers)
  File "/home/xxxx/lib/python3.10/site-packages/requests/api.py", line 115, in post
    return request("post", url, data=data, json=json, **kwargs)
  File "/home/xxxx/lib/python3.10/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/xxxx/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/xxxx/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/home/xxxx/lib/python3.10/site-packages/requests/adapters.py", line 700, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='0.0.0.0', port=5002): Max retries exceeded with url: /execute_function/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f1deb5867a0>: Failed to establish a new connection: [Errno 111] Connection refused'))
pseudotensor commented 1 month ago

It just looks like the function server isn't even up. Perhaps you have something else on that port etc. Check startup logs.

llmwesee commented 1 month ago

They should all be independent unless you changed CONCURRENCY_COUNT to be 1. This is tested normally. The backend has no issues with this at all.

when setting concurrency count to be 64:

python generate.py --base_model=meta-llama/Meta-Llama-3.1-8B-Instruct --score_model=None --langchain_mode='UserData' --user_path=user_path --use_auth_token=True --visible_visible_models=False --max_seq_len=8192 --max_max_new_tokens=4096 --max_new_tokens=4096 --min_new_tokens=256 --api_open=True --allow_api=True --max_quality=True --function_server=True --function_server_workers=5 --multiple_workers_gunicorn=True --function_server_port=5002 --function_api_key=API_KEY --concurrency_count=64

then the following error is shown:

File "/home/xxxx/src/gen.py", line 1736, in main
    raise ValueError(
ValueError: Concurrency count > 1 will lead to mixup in cache use for local LLMs, disable this raise at own risk.
pseudotensor commented 1 month ago

Correct, I recommend vLLM for handling concurrency well, transformers is not itself thread safe.