Open llmwesee opened 2 months ago
They should all be independent unless you changed CONCURRENCY_COUNT to be 1. This is tested normally. The backend has no issues with this at all.
Once you have that working, I can explain how to make it even more efficient using the function_server.
this is the command for running h2ogpt with login.
python generate.py --base_model=meta-llama/Meta-Llama-3.1-8B-Instruct --score_model=None --langchain_mode='UserData' --user_path=user_path --auth='' --use_auth_token=True --visible_visible_models=False --max_seq_len=8192 --max_max_new_tokens=4096 --max_new_tokens=4096 --min_new_tokens=256
can you show me some examples for having h2ogpt as fully backend server running with full functionality from query processing to document uploading for multiple users concurrently & independently . I want to integrated it's backend with react or next.js as frontend with having full functionality like as h2ogpt and having a datalake for all related document things
I'd guess I'd need to ask how you see things blocked. E.g. if you had a pytest test code that you are running that shows how things are blocking each other (e.g. long add of dock and then chat is blocked in another test you ran with -n 2) or you just show video of the UI and what you are doing, I can mimic it and see if I can see what you are seeing.
As for the function server, you can try it. Just add to CLI:
--function_server=True --function_server_workers=5 --multiple_workers_gunicorn=True --function_server_port=5002 --function_api_key=API_KEY
the function server has issue when hitting through upload_api
and add_file_api
Traceback (most recent call last):
File "/home/abc/Documents/xxxx/xxxx/src/gpt_langchain.py", line 9383, in update_user_db
return _update_user_db(file, db1s=db1s,
File "/home/xxxx/src/gpt_langchain.py", line 9664, in _update_user_db
sources = call_function_server('0.0.0.0', function_server_port, 'path_to_docs', (file,), simple_kwargs,
File "/home/xxxx/src/function_client.py", line 50, in call_function_server
execute_result = execute_function_on_server(host, port, function_name, args, kwargs, use_disk, use_pickle,
File "/home/xxxx/src/function_client.py", line 21, in execute_function_on_server
response = requests.post(url, json=payload, headers=headers)
File "/home/xxxx/lib/python3.10/site-packages/requests/api.py", line 115, in post
return request("post", url, data=data, json=json, **kwargs)
File "/home/xxxx/lib/python3.10/site-packages/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "/home/xxxx/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "/home/xxxx/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "/home/xxxx/lib/python3.10/site-packages/requests/adapters.py", line 700, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='0.0.0.0', port=5002): Max retries exceeded with url: /execute_function/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f1deb5867a0>: Failed to establish a new connection: [Errno 111] Connection refused'))
It just looks like the function server isn't even up. Perhaps you have something else on that port etc. Check startup logs.
They should all be independent unless you changed CONCURRENCY_COUNT to be 1. This is tested normally. The backend has no issues with this at all.
when setting concurrency count to be 64:
python generate.py --base_model=meta-llama/Meta-Llama-3.1-8B-Instruct --score_model=None --langchain_mode='UserData' --user_path=user_path --use_auth_token=True --visible_visible_models=False --max_seq_len=8192 --max_max_new_tokens=4096 --max_new_tokens=4096 --min_new_tokens=256 --api_open=True --allow_api=True --max_quality=True --function_server=True --function_server_workers=5 --multiple_workers_gunicorn=True --function_server_port=5002 --function_api_key=API_KEY --concurrency_count=64
then the following error is shown:
File "/home/xxxx/src/gen.py", line 1736, in main
raise ValueError(
ValueError: Concurrency count > 1 will lead to mixup in cache use for local LLMs, disable this raise at own risk.
Correct, I recommend vLLM for handling concurrency well, transformers is not itself thread safe.
I have implemented a solution using vLLM on an A100 server to support multiple users. However, I have encountered an issue:
While one user's query is being processed, other users are unable to upload documents into the
UserData
or MyData collections. The document upload process gets stuck at the processing stage without any errors appearing in the terminal or UI. Additionally, the document is not uploaded successfully.Can you suggest ways to decouple the query processing, document upload, and user interface programs so they can run independently of each other?
Alternatively, can we build or use prebuilt separate APIs to manage program in the backend? Please provide suggestions or potential solutions.