netease-youdao / QAnything

Question and Answer based on Anything.
https://qanything.ai
GNU Affero General Public License v3.0
11.94k stars 1.16k forks source link

[BUG] <title> Python版本GPU设置 #459

Open CodeLyokoscj opened 4 months ago

CodeLyokoscj commented 4 months ago

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

目的:运行7B大模型(Qwen-7B-QAnything)时,欲将GPU从默认0卡设置为指定卡(如2卡) 操作:将 QAnything/scripts/base_run.sh 中的环境变量 CUDA_VISIBLE_DEVICE=0 修改为 CUDA_VISIBLE_DEVICE=2

期望行为 | Expected Behavior

模型在cuda:2上运行

运行环境 | Environment

- OS:centos:7
- NVIDIA Driver:550.54.15
- CUDA:12.4
- docker:/
- docker-compose:/
- NVIDIA GPU: NVIDIA L40
- NVIDIA GPU Memory: 48GB

QAnything日志 | QAnything logs

即将启动后端服务,启动成功后请复制[http://0.0.0.0:7811/qanything/]到浏览器进行测试。 运行qanything-server的命令是: CUDA_VISIBLE_DEVICES=2 python3 -m qanything_kernel.qanything_server.sanic_api --host 0.0.0.0 --port 7811 --model_size 7B --device_id 0
LOCAL DATA PATH: /home/demo/miniconda3/envs/scj_qanything/QAnything/QANY_DB/content LOCAL_RERANK_REPO: netease-youdao/bce-reranker-base_v1 LOCAL_EMBED_REPO: netease-youdao/bce-embedding-base_v1 2024-07-30 10:26:52,949 - modelscope - INFO - PyTorch version 2.1.2 Found. 2024-07-30 10:26:52,950 - modelscope - INFO - Loading ast index from /home/demo/.cache/modelscope/ast_indexer 2024-07-30 10:26:52,977 - modelscope - INFO - Loading done! Current index file version is 1.13.0, with md5 3c282ca1588864182fae1147db03023e and a total number of 972 components indexed use_cpu: False use_openai_api: False The server is starting on port: 7811 onnxruntime-gpu 1.17.1 已经安装。 vllm 0.2.7 已经安装。 lalala:1 2024-07-30 10:26:53,596 GPU memory: 45GB 2024-07-30 10:26:53,597 GPU memory utilization: 0.9 2024-07-30 10:26:53,598 /home/demo/miniconda3/envs/scj_qanything/QAnything/assets/custom_models/netease-youdao/Qwen-7B-QAnything路径已存在,不再重复下载大模型(如果下载出错可手动删除此目录) 2024-07-30 10:26:53,598 CUDA_DEVICE: 0 ...... [2024-07-30 10:26:58 +0800] [40625] [WARNING] Sanic is running in PRODUCTION mode. Consider using '--debug' or '--dev' while actively developing your application. [2024-07-30 10:26:58 +0800] [40625] [INFO] Sanic Extensions: [2024-07-30 10:26:58 +0800] [40625] [INFO] > injection [0 dependencies; 0 constants] [2024-07-30 10:26:58 +0800] [40625] [INFO] > openapi [http://0.0.0.0:7811/docs] [2024-07-30 10:26:58 +0800] [40625] [INFO] > http [2024-07-30 10:26:58 +0800] [40625] [INFO] > templating [jinja2==3.1.4] INFO 07-30 10:26:58 llm_engine.py:70] Initializing an LLM engine with config: model='/home/demo/miniconda3/envs/scj_qanything/QAnything/assets/custom_models/netease-youdao/Qwen-7B-QAnything', tokenizer='/home/demo/miniconda3/envs/scj_qanything/QAnything/assets/custom_models/netease-youdao/Qwen-7B-QAnything', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, seed=0) INFO 07-30 10:27:13 llm_engine.py:275] # GPU blocks: 89, # CPU blocks: 512 [2024-07-30 10:27:13 +0800] [40625] [ERROR] Experienced exception while trying to serve Traceback (most recent call last): File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/mixins/startup.py", line 958, in serve_single worker_serve(monitor_publisher=None, kwargs) File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/worker/serve.py", line 143, in worker_serve raise e File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/worker/serve.py", line 117, in worker_serve return _serve_http_1( File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/server/runners.py", line 223, in _serve_http_1 loop.run_until_complete(app._server_event("init", "before")) File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/app.py", line 1764, in _server_event await self.dispatch( File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/signals.py", line 208, in dispatch return await dispatch File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/signals.py", line 183, in _dispatch raise e File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/signals.py", line 167, in _dispatch retval = await maybe_coroutine File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/app.py", line 1315, in _listener await maybe_coro File "/home/demo/miniconda3/envs/scj_qanything/QAnything/qanything_kernel/qanything_server/sanic_api.py", line 203, in init_local_doc_qa local_doc_qa.init_cfg(args=args) File "/home/demo/miniconda3/envs/scj_qanything/QAnything/qanything_kernel/core/local_doc_qa.py", line 70, in init_cfg self.llm: OpenAICustomLLM = OpenAICustomLLM(args) File "/home/demo/miniconda3/envs/scj_qanything/QAnything/qanything_kernel/connector/llm/llm_for_fastchat.py", line 41, in init self.engine = AsyncLLMEngine.from_engine_args(engine_args) File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 500, in from_engine_args engine = cls(parallel_config.worker_use_ray, File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 273, in init self.engine = self._init_engine(*args, *kwargs) File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 318, in _init_engine return engine_class(args, kwargs) File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 114, in init self._init_cache() File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 284, in _init_cache raise ValueError( ValueError: The model's max seq len (8192) is larger than the maximum number of tokens that can be stored in KV cache (1424). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine. [2024-07-30 10:27:13 +0800] [40625] [INFO] Server Stopped Traceback (most recent call last): File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/demo/miniconda3/envs/scj_qanything/QAnything/qanything_kernel/qanything_server/sanic_api.py", line 258, in app.run(host=args.host, port=args.port, single_process=True, access_log=False) File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/mixins/startup.py", line 215, in run serve(primary=self) # type: ignore File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/mixins/startup.py", line 958, in serve_single worker_serve(monitor_publisher=None, kwargs) File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/worker/serve.py", line 143, in worker_serve raise e File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/worker/serve.py", line 117, in worker_serve return _serve_http_1( File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/server/runners.py", line 223, in _serve_http_1 loop.run_until_complete(app._server_event("init", "before")) File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/app.py", line 1764, in _server_event await self.dispatch( File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/signals.py", line 208, in dispatch return await dispatch File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/signals.py", line 183, in _dispatch raise e File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/signals.py", line 167, in _dispatch retval = await maybe_coroutine File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/sanic/app.py", line 1315, in _listener await maybe_coro File "/home/demo/miniconda3/envs/scj_qanything/QAnything/qanything_kernel/qanything_server/sanic_api.py", line 203, in init_local_doc_qa local_doc_qa.init_cfg(args=args) File "/home/demo/miniconda3/envs/scj_qanything/QAnything/qanything_kernel/core/local_doc_qa.py", line 70, in init_cfg self.llm: OpenAICustomLLM = OpenAICustomLLM(args) File "/home/demo/miniconda3/envs/scj_qanything/QAnything/qanything_kernel/connector/llm/llm_for_fastchat.py", line 41, in init self.engine = AsyncLLMEngine.from_engine_args(engine_args) File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 500, in from_engine_args engine = cls(parallel_config.worker_use_ray, File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 273, in init self.engine = self._init_engine(*args, *kwargs) File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 318, in _init_engine return engine_class(args, kwargs) File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 114, in init self._init_cache() File "/home/demo/miniconda3/envs/qanything-python/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 284, in _init_cache raise ValueError( ValueError: The model's max seq len (8192) is larger than the maximum number of tokens that can be stored in KV cache (1424). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.

复现方法 | Steps To Reproduce

注:其实不用复现,我已在备注部分给出bug fix

  1. environ:
    • OS:centos:7
    • NVIDIA Driver:550.54.15
    • CUDA:12.4
    • docker:/
    • docker-compose:/
    • NVIDIA GPU: NVIDIA L40
    • NVIDIA GPU Memory: 48GB
  2. config: 将 QAnything/scripts/base_run.sh 中的环境变量 CUDA_VISIBLE_DEVICE=0 修改为 CUDA_VISIBLE_DEVICE=2
  3. run: bash scripts/run_for_7B_in_Linux_or_WSL.sh
  4. See error 2024-07-30 10:26:53,598 /home/demo/miniconda3/envs/scj_qanything/QAnything/assets/custom_models/netease-youdao/Qwen-7B-QAnything路径已存在,不再重复下载大模型(如果下载出错可手动删除此目录) 2024-07-30 10:26:53,598 CUDA_DEVICE: 0 ← 可见仍然在0卡上运行

备注 | Anything else?

Bug fix: 将 QAnything/qanything_kernel/qanything_server/sanic_api.py 中的133行(见下)删除,即可解决问题: 133 os.environ["CUDA_VISIBLE_DEVICES"] = args.device_id

原因:在QAnything/scripts/base_run.sh中已经设置过了环境变量CUDA_VISIBLE_DEVICE,不用再次在sanic_api.py中设置(而且这赋值的还是deviceid,还是错的 --|| )。

其他: 为什么不提PR——因为github上的不是Python版本,没有133行 -_-|| 只能提issue了