2024-06-26 10:34:05,491 INFO worker.py:1770 -- Started a local Ray instance.
INFO 06-26 10:34:06 config.py:623] Defaulting to use mp for distributed inference
Traceback (most recent call last):
File "/root/anaconda3/envs/hxx2/bin/llamafactory-cli", line 8, in
sys.exit(main())
File "/home/hxx/LLaMA-Factory-main/src/llamafactory/cli.py", line 79, in main
run_api()
File "/home/hxx/LLaMA-Factory-main/src/llamafactory/api/app.py", line 117, in run_api
chat_model = ChatModel()
File "/home/hxx/LLaMA-Factory-main/src/llamafactory/chat/chat_model.py", line 45, in init
self.engine: "BaseEngine" = VllmEngine(model_args, data_args, finetuning_args, generating_args)
File "/home/hxx/LLaMA-Factory-main/src/llamafactory/chat/vllm_engine.py", line 94, in init
self.model = AsyncLLMEngine.from_engine_args(AsyncEngineArgs(**engine_args))
File "/root/anaconda3/envs/hxx2/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 371, in from_engine_args
engine_config = engine_args.create_engine_config()
File "/root/anaconda3/envs/hxx2/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 766, in create_engine_config
return EngineConfig(model_config=model_config,
File "", line 13, in init
File "/root/anaconda3/envs/hxx2/lib/python3.10/site-packages/vllm/config.py", line 1378, in __post_init__
self.model_config.verify_with_parallel_config(self.parallel_config)
File "/root/anaconda3/envs/hxx2/lib/python3.10/site-packages/vllm/config.py", line 235, in verify_with_parallel_config
raise ValueError(
ValueError: Total number of attention heads (28) must be divisible by tensor parallel size (8).
Reminder
System Info
llamafactory
version: 0.8.3.dev0Reproduction
2024-06-26 10:34:05,491 INFO worker.py:1770 -- Started a local Ray instance. INFO 06-26 10:34:06 config.py:623] Defaulting to use mp for distributed inference Traceback (most recent call last): File "/root/anaconda3/envs/hxx2/bin/llamafactory-cli", line 8, in
sys.exit(main())
File "/home/hxx/LLaMA-Factory-main/src/llamafactory/cli.py", line 79, in main
run_api()
File "/home/hxx/LLaMA-Factory-main/src/llamafactory/api/app.py", line 117, in run_api
chat_model = ChatModel()
File "/home/hxx/LLaMA-Factory-main/src/llamafactory/chat/chat_model.py", line 45, in init
self.engine: "BaseEngine" = VllmEngine(model_args, data_args, finetuning_args, generating_args)
File "/home/hxx/LLaMA-Factory-main/src/llamafactory/chat/vllm_engine.py", line 94, in init
self.model = AsyncLLMEngine.from_engine_args(AsyncEngineArgs(**engine_args))
File "/root/anaconda3/envs/hxx2/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 371, in from_engine_args
engine_config = engine_args.create_engine_config()
File "/root/anaconda3/envs/hxx2/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 766, in create_engine_config
return EngineConfig(model_config=model_config,
File "", line 13, in init
File "/root/anaconda3/envs/hxx2/lib/python3.10/site-packages/vllm/config.py", line 1378, in __post_init__
self.model_config.verify_with_parallel_config(self.parallel_config)
File "/root/anaconda3/envs/hxx2/lib/python3.10/site-packages/vllm/config.py", line 235, in verify_with_parallel_config
raise ValueError(
ValueError: Total number of attention heads (28) must be divisible by tensor parallel size (8).
Expected behavior
No response
Others
No response