即将启动后端服务,启动成功后请复制[http://0.0.0.0:7811/qanything/]到浏览器进行测试。
运行qanything-server的命令是:
CUDA_VISIBLE_DEVICES=2 python3 -m qanything_kernel.qanything_server.sanic_api --host 0.0.0.0 --port 7811 --model_size 7B --device_id 0
LOCAL DATA PATH: /home/demo/miniconda3/envs/scj_qanything/QAnything/QANY_DB/content
LOCAL_RERANK_REPO: netease-youdao/bce-reranker-base_v1
LOCAL_EMBED_REPO: netease-youdao/bce-embedding-base_v1
2024-07-30 10:26:52,949 - modelscope - INFO - PyTorch version 2.1.2 Found.
2024-07-30 10:26:52,950 - modelscope - INFO - Loading ast index from /home/demo/.cache/modelscope/ast_indexer
2024-07-30 10:26:52,977 - modelscope - INFO - Loading done! Current index file version is 1.13.0, with md5 3c282ca1588864182fae1147db03023e and a total number of 972 components indexed
use_cpu: False
use_openai_api: False
The server is starting on port: 7811
onnxruntime-gpu 1.17.1 已经安装。
vllm 0.2.7 已经安装。
lalala:1
2024-07-30 10:26:53,596 GPU memory: 45GB
2024-07-30 10:26:53,597 GPU memory utilization: 0.9
2024-07-30 10:26:53,598 /home/demo/miniconda3/envs/scj_qanything/QAnything/assets/custom_models/netease-youdao/Qwen-7B-QAnything路径已存在,不再重复下载大模型(如果下载出错可手动删除此目录)
2024-07-30 10:26:53,598 CUDA_DEVICE: 0
......
[2024-07-30 10:26:58 +0800] [40625] [WARNING] Sanic is running in PRODUCTION mode. Consider using '--debug' or '--dev' while actively developing your application.
[2024-07-30 10:26:58 +0800] [40625] [INFO] Sanic Extensions:
[2024-07-30 10:26:58 +0800] [40625] [INFO] > injection [0 dependencies; 0 constants]
[2024-07-30 10:26:58 +0800] [40625] [INFO] > openapi [http://0.0.0.0:7811/docs]
[2024-07-30 10:26:58 +0800] [40625] [INFO] > http
[2024-07-30 10:26:58 +0800] [40625] [INFO] > templating [jinja2==3.1.4]
INFO 07-30 10:26:58 llm_engine.py:70] Initializing an LLM engine with config: model='/home/demo/miniconda3/envs/scj_qanything/QAnything/assets/custom_models/netease-youdao/Qwen-7B-QAnything', tokenizer='/home/demo/miniconda3/envs/scj_qanything/QAnything/assets/custom_models/netease-youdao/Qwen-7B-QAnything', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, seed=0)
INFO 07-30 10:27:13 llm_engine.py:275] # GPU blocks: 89, # CPU blocks: 512
[2024-07-30 10:27:13 +0800] [40625] [ERROR] Experienced exception while trying to serve
Traceback (most recent call last):
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/mixins/startup.py", line 958, in serve_single
worker_serve(monitor_publisher=None, kwargs)
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/worker/serve.py", line 143, in worker_serve
raise e
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/worker/serve.py", line 117, in worker_serve
return _serve_http_1(
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/server/runners.py", line 223, in _serve_http_1
loop.run_until_complete(app._server_event("init", "before"))
File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/app.py", line 1764, in _server_event
await self.dispatch(
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/signals.py", line 208, in dispatch
return await dispatch
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/signals.py", line 183, in _dispatch
raise e
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/signals.py", line 167, in _dispatch
retval = await maybe_coroutine
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/app.py", line 1315, in _listener
await maybe_coro
File "/home/demo/miniconda3/envs/scj_qanything/QAnything/qanything_kernel/qanything_server/sanic_api.py", line 203, in init_local_doc_qa
local_doc_qa.init_cfg(args=args)
File "/home/demo/miniconda3/envs/scj_qanything/QAnything/qanything_kernel/core/local_doc_qa.py", line 70, in init_cfg
self.llm: OpenAICustomLLM = OpenAICustomLLM(args)
File "/home/demo/miniconda3/envs/scj_qanything/QAnything/qanything_kernel/connector/llm/llm_for_fastchat.py", line 41, in init
self.engine = AsyncLLMEngine.from_engine_args(engine_args)
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 500, in from_engine_args
engine = cls(parallel_config.worker_use_ray,
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 273, in init
self.engine = self._init_engine(*args, *kwargs)
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 318, in _init_engine
return engine_class(args, kwargs)
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 114, in init
self._init_cache()
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 284, in _init_cache
raise ValueError(
ValueError: The model's max seq len (8192) is larger than the maximum number of tokens that can be stored in KV cache (1424). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.
[2024-07-30 10:27:13 +0800] [40625] [INFO] Server Stopped
Traceback (most recent call last):
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/demo/miniconda3/envs/scj_qanything/QAnything/qanything_kernel/qanything_server/sanic_api.py", line 258, in
app.run(host=args.host, port=args.port, single_process=True, access_log=False)
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/mixins/startup.py", line 215, in run
serve(primary=self) # type: ignore
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/mixins/startup.py", line 958, in serve_single
worker_serve(monitor_publisher=None, kwargs)
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/worker/serve.py", line 143, in worker_serve
raise e
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/worker/serve.py", line 117, in worker_serve
return _serve_http_1(
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/server/runners.py", line 223, in _serve_http_1
loop.run_until_complete(app._server_event("init", "before"))
File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/app.py", line 1764, in _server_event
await self.dispatch(
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/signals.py", line 208, in dispatch
return await dispatch
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/signals.py", line 183, in _dispatch
raise e
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/signals.py", line 167, in _dispatch
retval = await maybe_coroutine
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/app.py", line 1315, in _listener
await maybe_coro
File "/home/demo/miniconda3/envs/scj_qanything/QAnything/qanything_kernel/qanything_server/sanic_api.py", line 203, in init_local_doc_qa
local_doc_qa.init_cfg(args=args)
File "/home/demo/miniconda3/envs/scj_qanything/QAnything/qanything_kernel/core/local_doc_qa.py", line 70, in init_cfg
self.llm: OpenAICustomLLM = OpenAICustomLLM(args)
File "/home/demo/miniconda3/envs/scj_qanything/QAnything/qanything_kernel/connector/llm/llm_for_fastchat.py", line 41, in init
self.engine = AsyncLLMEngine.from_engine_args(engine_args)
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 500, in from_engine_args
engine = cls(parallel_config.worker_use_ray,
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 273, in init
self.engine = self._init_engine(*args, *kwargs)
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 318, in _init_engine
return engine_class(args, kwargs)
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 114, in init
self._init_cache()
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 284, in _init_cache
raise ValueError(
ValueError: The model's max seq len (8192) is larger than the maximum number of tokens that can be stored in KV cache (1424). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
当前行为 | Current Behavior
目的:运行7B大模型(Qwen-7B-QAnything)时,欲将GPU从默认0卡设置为指定卡(如2卡) 操作:将 QAnything/scripts/base_run.sh 中的环境变量 CUDA_VISIBLE_DEVICE=0 修改为 CUDA_VISIBLE_DEVICE=2
期望行为 | Expected Behavior
模型在cuda:2上运行
运行环境 | Environment
QAnything日志 | QAnything logs
即将启动后端服务,启动成功后请复制[http://0.0.0.0:7811/qanything/]到浏览器进行测试。 运行qanything-server的命令是: CUDA_VISIBLE_DEVICES=2 python3 -m qanything_kernel.qanything_server.sanic_api --host 0.0.0.0 --port 7811 --model_size 7B --device_id 0
app.run(host=args.host, port=args.port, single_process=True, access_log=False)
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/mixins/startup.py", line 215, in run
serve(primary=self) # type: ignore
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/mixins/startup.py", line 958, in serve_single
worker_serve(monitor_publisher=None, kwargs)
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/worker/serve.py", line 143, in worker_serve
raise e
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/worker/serve.py", line 117, in worker_serve
return _serve_http_1(
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/server/runners.py", line 223, in _serve_http_1
loop.run_until_complete(app._server_event("init", "before"))
File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/app.py", line 1764, in _server_event
await self.dispatch(
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/signals.py", line 208, in dispatch
return await dispatch
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/signals.py", line 183, in _dispatch
raise e
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/signals.py", line 167, in _dispatch
retval = await maybe_coroutine
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/app.py", line 1315, in _listener
await maybe_coro
File "/home/demo/miniconda3/envs/scj_qanything/QAnything/qanything_kernel/qanything_server/sanic_api.py", line 203, in init_local_doc_qa
local_doc_qa.init_cfg(args=args)
File "/home/demo/miniconda3/envs/scj_qanything/QAnything/qanything_kernel/core/local_doc_qa.py", line 70, in init_cfg
self.llm: OpenAICustomLLM = OpenAICustomLLM(args)
File "/home/demo/miniconda3/envs/scj_qanything/QAnything/qanything_kernel/connector/llm/llm_for_fastchat.py", line 41, in init
self.engine = AsyncLLMEngine.from_engine_args(engine_args)
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 500, in from_engine_args
engine = cls(parallel_config.worker_use_ray,
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 273, in init
self.engine = self._init_engine(*args, *kwargs)
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 318, in _init_engine
return engine_class(args, kwargs)
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 114, in init
self._init_cache()
File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 284, in _init_cache
raise ValueError(
ValueError: The model's max seq len (8192) is larger than the maximum number of tokens that can be stored in KV cache (1424). Try increasing
LOCAL DATA PATH: /home/demo/miniconda3/envs/scj_qanything/QAnything/QANY_DB/content LOCAL_RERANK_REPO: netease-youdao/bce-reranker-base_v1 LOCAL_EMBED_REPO: netease-youdao/bce-embedding-base_v1 2024-07-30 10:26:52,949 - modelscope - INFO - PyTorch version 2.1.2 Found. 2024-07-30 10:26:52,950 - modelscope - INFO - Loading ast index from /home/demo/.cache/modelscope/ast_indexer 2024-07-30 10:26:52,977 - modelscope - INFO - Loading done! Current index file version is 1.13.0, with md5 3c282ca1588864182fae1147db03023e and a total number of 972 components indexed use_cpu: False use_openai_api: False The server is starting on port: 7811 onnxruntime-gpu 1.17.1 已经安装。 vllm 0.2.7 已经安装。 lalala:1 2024-07-30 10:26:53,596 GPU memory: 45GB 2024-07-30 10:26:53,597 GPU memory utilization: 0.9 2024-07-30 10:26:53,598 /home/demo/miniconda3/envs/scj_qanything/QAnything/assets/custom_models/netease-youdao/Qwen-7B-QAnything路径已存在,不再重复下载大模型(如果下载出错可手动删除此目录) 2024-07-30 10:26:53,598 CUDA_DEVICE: 0 ...... [2024-07-30 10:26:58 +0800] [40625] [WARNING] Sanic is running in PRODUCTION mode. Consider using '--debug' or '--dev' while actively developing your application. [2024-07-30 10:26:58 +0800] [40625] [INFO] Sanic Extensions: [2024-07-30 10:26:58 +0800] [40625] [INFO] > injection [0 dependencies; 0 constants] [2024-07-30 10:26:58 +0800] [40625] [INFO] > openapi [http://0.0.0.0:7811/docs] [2024-07-30 10:26:58 +0800] [40625] [INFO] > http [2024-07-30 10:26:58 +0800] [40625] [INFO] > templating [jinja2==3.1.4] INFO 07-30 10:26:58 llm_engine.py:70] Initializing an LLM engine with config: model='/home/demo/miniconda3/envs/scj_qanything/QAnything/assets/custom_models/netease-youdao/Qwen-7B-QAnything', tokenizer='/home/demo/miniconda3/envs/scj_qanything/QAnything/assets/custom_models/netease-youdao/Qwen-7B-QAnything', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, seed=0) INFO 07-30 10:27:13 llm_engine.py:275] # GPU blocks: 89, # CPU blocks: 512 [2024-07-30 10:27:13 +0800] [40625] [ERROR] Experienced exception while trying to serve Traceback (most recent call last): File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/mixins/startup.py", line 958, in serve_single worker_serve(monitor_publisher=None, kwargs) File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/worker/serve.py", line 143, in worker_serve raise e File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/worker/serve.py", line 117, in worker_serve return _serve_http_1( File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/server/runners.py", line 223, in _serve_http_1 loop.run_until_complete(app._server_event("init", "before")) File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/app.py", line 1764, in _server_event await self.dispatch( File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/signals.py", line 208, in dispatch return await dispatch File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/signals.py", line 183, in _dispatch raise e File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/signals.py", line 167, in _dispatch retval = await maybe_coroutine File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/app.py", line 1315, in _listener await maybe_coro File "/home/demo/miniconda3/envs/scj_qanything/QAnything/qanything_kernel/qanything_server/sanic_api.py", line 203, in init_local_doc_qa local_doc_qa.init_cfg(args=args) File "/home/demo/miniconda3/envs/scj_qanything/QAnything/qanything_kernel/core/local_doc_qa.py", line 70, in init_cfg self.llm: OpenAICustomLLM = OpenAICustomLLM(args) File "/home/demo/miniconda3/envs/scj_qanything/QAnything/qanything_kernel/connector/llm/llm_for_fastchat.py", line 41, in init self.engine = AsyncLLMEngine.from_engine_args(engine_args) File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 500, in from_engine_args engine = cls(parallel_config.worker_use_ray, File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 273, in init self.engine = self._init_engine(*args, *kwargs) File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 318, in _init_engine return engine_class(args, kwargs) File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 114, in init self._init_cache() File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 284, in _init_cache raise ValueError( ValueError: The model's max seq len (8192) is larger than the maximum number of tokens that can be stored in KV cache (1424). Try increasing
gpu_memory_utilization
or decreasingmax_model_len
when initializing the engine. [2024-07-30 10:27:13 +0800] [40625] [INFO] Server Stopped Traceback (most recent call last): File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/demo/miniconda3/envs/scj_qanything/QAnything/qanything_kernel/qanything_server/sanic_api.py", line 258, ingpu_memory_utilization
or decreasingmax_model_len
when initializing the engine.复现方法 | Steps To Reproduce
注:其实不用复现,我已在备注部分给出bug fix
备注 | Anything else?
Bug fix: 将 QAnything/qanything_kernel/qanything_server/sanic_api.py 中的133行(见下)删除,即可解决问题: 133 os.environ["CUDA_VISIBLE_DEVICES"] = args.device_id
原因:在QAnything/scripts/base_run.sh中已经设置过了环境变量CUDA_VISIBLE_DEVICE,不用再次在sanic_api.py中设置(而且这赋值的还是deviceid,还是错的 --|| )。
其他: 为什么不提PR——因为github上的不是Python版本,没有133行 -_-|| 只能提issue了