CUDA_VISIBLE_DEVICES=0 API_PORT=8244 llamafactory-cli api examples/inference/llama3.yaml
CUDA_VISIBLE_DEVICES=0 API_PORT=8244 llamafactory-cli api examples/inference/llama3_vllm.yaml
Expected behavior
[INFO|tokenization_utils_base.py:2106] 2024-06-20 22:48:38,089 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2106] 2024-06-20 22:48:38,089 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2106] 2024-06-20 22:48:38,089 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2106] 2024-06-20 22:48:38,089 >> loading file tokenizer_config.json
[WARNING|logging.py:314] 2024-06-20 22:48:39,221 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
06/20/2024 22:48:39 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>
06/20/2024 22:48:39 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>
[INFO|configuration_utils.py:731] 2024-06-20 22:48:39,435 >> loading configuration file /home/work/wenku_yq/DataVault/models/Meta-Llama-3-8B-Instruct/config.json
[INFO|configuration_utils.py:796] 2024-06-20 22:48:39,436 >> Model config LlamaConfig {
"_name_or_path": "/home/work/wenku_yq/DataVault/models/Meta-Llama-3-8B-Instruct/",
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 128000,
"eos_token_id": 128001,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 14336,
"max_position_embeddings": 8192,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 8,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 500000.0,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.41.2",
"use_cache": true,
"vocab_size": 128256
}
06/20/2024 22:48:39 - INFO - llamafactory.model.patcher - Using KV cache for faster generation.
[INFO|modeling_utils.py:3471] 2024-06-20 22:48:39,519 >> loading weights file /home/work/wenku_yq/DataVault/models/Meta-Llama-3-8B-Instruct/model.safetensors.index.json
[INFO|modeling_utils.py:1519] 2024-06-20 22:48:39,527 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:962] 2024-06-20 22:48:39,529 >> Generate config GenerationConfig {
"bos_token_id": 128000,
"eos_token_id": 128001
}
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [03:04<00:00, 46.20s/it]
[INFO|modeling_utils.py:4280] 2024-06-20 22:51:46,632 >> All model checkpoint weights were used when initializing LlamaForCausalLM.
[INFO|modeling_utils.py:4288] 2024-06-20 22:51:46,632 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at /home/work/wenku_yq/DataVault/models/Meta-Llama-3-8B-Instruct/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.
[INFO|configuration_utils.py:915] 2024-06-20 22:51:46,654 >> loading configuration file /home/work/wenku_yq/DataVault/models/Meta-Llama-3-8B-Instruct/generation_config.json
[INFO|configuration_utils.py:962] 2024-06-20 22:51:46,655 >> Generate config GenerationConfig {
"bos_token_id": 128000,
"eos_token_id": 128001
}
06/20/2024 22:51:46 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference.
06/20/2024 22:51:46 - INFO - llamafactory.model.loader - all params: 8030261248
Visit http://localhost:8244/docs for API document.
INFO: Started server process [74545]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8244 (Press CTRL+C to quit)
INFO: 127.0.0.1:29184 - "GET /docs HTTP/1.1" 200 OK
INFO: 127.0.0.1:29184 - "GET /openapi.json HTTP/1.1" 200 OK
INFO: 127.0.0.1:25164 - "GET /v1/models HTTP/1.1" 200 OK
INFO: 127.0.0.1:25164 - "GET /v1/models HTTP/1.1" 200 OK
06/20/2024 22:52:42 - INFO - llamafactory.api.chat - ==== request ====
{
"model": "string",
"messages": [
{
"role": "user",
"content": "string",
"tool_calls": [
{
"id": "string",
"type": "function",
"function": {
"name": "string",
"arguments": "string"
}
}
]
}
],
"tools": [
{
"type": "function",
"function": {
"name": "string",
"description": "string",
"parameters": {}
}
}
],
"do_sample": true,
"temperature": 0.0,
"top_p": 0.0,
"n": 1,
"max_tokens": 0,
"stop": "string",
"stream": false
}
INFO: 127.0.0.1:33696 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in call
return await self.app(scope, receive, send)
File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in call
await super().call(scope, receive, send)
File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/starlette/applications.py", line 123, in call
await self.middleware_stack(scope, receive, send)
File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in call
raise exc
File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in call
await self.app(scope, receive, _send)
File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/starlette/middleware/cors.py", line 93, in call
await self.simple_response(scope, receive, send, request_headers=headers)
File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/starlette/middleware/cors.py", line 148, in simple_response
await self.app(scope, receive, send)
File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in call
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/starlette/routing.py", line 756, in call
await self.middleware_stack(scope, receive, send)
File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/starlette/routing.py", line 776, in app
await route.handle(scope, receive, send)
File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
await self.app(scope, receive, send)
File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/starlette/routing.py", line 72, in app
response = await func(request)
File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app
raw_response = await run_endpoint_function(
File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
return await dependant.call(values)
File "/home/users/liuyi33/temp/workshop/LLaMA-Factory/src/llamafactory/api/app.py", line 99, in create_chat_completion
return await create_chat_completion_response(request, chat_model)
File "/home/users/liuyi33/temp/workshop/LLaMA-Factory/src/llamafactory/api/chat.py", line 148, in create_chat_completion_response
responses = await chat_model.achat(
File "/home/users/liuyi33/temp/workshop/LLaMA-Factory/src/llamafactory/chat/chat_model.py", line 72, in achat
return await self.engine.chat(messages, system, tools, image, input_kwargs)
File "/home/users/liuyi33/temp/workshop/LLaMA-Factory/src/llamafactory/chat/hf_engine.py", line 296, in chat
return await loop.run_in_executor(pool, self._chat, input_args)
File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(self.args, *self.kwargs)
File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(args, **kwargs)
File "/home/users/liuyi33/temp/workshop/LLaMA-Factory/src/llamafactory/chat/hf_engine.py", line 186, in _chat
gen_kwargs, prompt_length = HuggingfaceEngine._process_args(
File "/home/users/liuyi33/temp/workshop/LLaMA-Factory/src/llamafactory/chat/hf_engine.py", line 95, in _process_args
promptids, = template.encode_oneturn(
File "/home/users/liuyi33/temp/workshop/LLaMA-Factory/src/llamafactory/data/template.py", line 60, in encode_oneturn
encoded_pairs = self._encode(tokenizer, messages, system, tools, cutoff_len, reserved_label_len)
File "/home/users/liuyi33/temp/workshop/LLaMA-Factory/src/llamafactory/data/template.py", line 111, in _encode
tool_text = self.format_tools.apply(content=tools)[0] if tools else ""
File "/home/users/liuyi33/temp/workshop/LLaMA-Factory/src/llamafactory/data/formatter.py", line 225, in apply
return [self._tool_formatter(tools) if len(tools) != 0 else ""]
File "/home/users/liuyi33/temp/workshop/LLaMA-Factory/src/llamafactory/data/formatter.py", line 47, in default_tool_formatter
for name, param in tool["parameters"]["properties"].items():
KeyError: 'properties'
Reminder
System Info
llamafactory
version: 0.8.3.dev0Reproduction
CUDA_VISIBLE_DEVICES=0 API_PORT=8244 llamafactory-cli api examples/inference/llama3.yaml
CUDA_VISIBLE_DEVICES=0 API_PORT=8244 llamafactory-cli api examples/inference/llama3_vllm.yaml
Expected behavior
[INFO|tokenization_utils_base.py:2106] 2024-06-20 22:48:38,089 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2106] 2024-06-20 22:48:38,089 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2106] 2024-06-20 22:48:38,089 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2106] 2024-06-20 22:48:38,089 >> loading file tokenizer_config.json [WARNING|logging.py:314] 2024-06-20 22:48:39,221 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 06/20/2024 22:48:39 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|> 06/20/2024 22:48:39 - INFO - llamafactory.data.template - Add pad token: <|eot_id|> [INFO|configuration_utils.py:731] 2024-06-20 22:48:39,435 >> loading configuration file /home/work/wenku_yq/DataVault/models/Meta-Llama-3-8B-Instruct/config.json [INFO|configuration_utils.py:796] 2024-06-20 22:48:39,436 >> Model config LlamaConfig { "_name_or_path": "/home/work/wenku_yq/DataVault/models/Meta-Llama-3-8B-Instruct/", "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": 128001, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 8192, "mlp_bias": false, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "rope_theta": 500000.0, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.41.2", "use_cache": true, "vocab_size": 128256 }
06/20/2024 22:48:39 - INFO - llamafactory.model.patcher - Using KV cache for faster generation. [INFO|modeling_utils.py:3471] 2024-06-20 22:48:39,519 >> loading weights file /home/work/wenku_yq/DataVault/models/Meta-Llama-3-8B-Instruct/model.safetensors.index.json [INFO|modeling_utils.py:1519] 2024-06-20 22:48:39,527 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16. [INFO|configuration_utils.py:962] 2024-06-20 22:48:39,529 >> Generate config GenerationConfig { "bos_token_id": 128000, "eos_token_id": 128001 }
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [03:04<00:00, 46.20s/it] [INFO|modeling_utils.py:4280] 2024-06-20 22:51:46,632 >> All model checkpoint weights were used when initializing LlamaForCausalLM.
[INFO|modeling_utils.py:4288] 2024-06-20 22:51:46,632 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at /home/work/wenku_yq/DataVault/models/Meta-Llama-3-8B-Instruct/. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. [INFO|configuration_utils.py:915] 2024-06-20 22:51:46,654 >> loading configuration file /home/work/wenku_yq/DataVault/models/Meta-Llama-3-8B-Instruct/generation_config.json [INFO|configuration_utils.py:962] 2024-06-20 22:51:46,655 >> Generate config GenerationConfig { "bos_token_id": 128000, "eos_token_id": 128001 }
06/20/2024 22:51:46 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference. 06/20/2024 22:51:46 - INFO - llamafactory.model.loader - all params: 8030261248 Visit http://localhost:8244/docs for API document. INFO: Started server process [74545] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8244 (Press CTRL+C to quit) INFO: 127.0.0.1:29184 - "GET /docs HTTP/1.1" 200 OK INFO: 127.0.0.1:29184 - "GET /openapi.json HTTP/1.1" 200 OK INFO: 127.0.0.1:25164 - "GET /v1/models HTTP/1.1" 200 OK INFO: 127.0.0.1:25164 - "GET /v1/models HTTP/1.1" 200 OK 06/20/2024 22:52:42 - INFO - llamafactory.api.chat - ==== request ==== { "model": "string", "messages": [ { "role": "user", "content": "string", "tool_calls": [ { "id": "string", "type": "function", "function": { "name": "string", "arguments": "string" } } ] } ], "tools": [ { "type": "function", "function": { "name": "string", "description": "string", "parameters": {} } } ], "do_sample": true, "temperature": 0.0, "top_p": 0.0, "n": 1, "max_tokens": 0, "stop": "string", "stream": false } INFO: 127.0.0.1:33696 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error ERROR: Exception in ASGI application Traceback (most recent call last): File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi result = await app( # type: ignore[func-returns-value] File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in call return await self.app(scope, receive, send) File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in call await super().call(scope, receive, send) File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/starlette/applications.py", line 123, in call await self.middleware_stack(scope, receive, send) File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in call raise exc File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in call await self.app(scope, receive, _send) File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/starlette/middleware/cors.py", line 93, in call await self.simple_response(scope, receive, send, request_headers=headers) File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/starlette/middleware/cors.py", line 148, in simple_response await self.app(scope, receive, send) File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in call await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/starlette/routing.py", line 756, in call await self.middleware_stack(scope, receive, send) File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/starlette/routing.py", line 776, in app await route.handle(scope, receive, send) File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle await self.app(scope, receive, send) File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/starlette/routing.py", line 77, in app await wrap_app_handling_exceptions(app, request)(scope, receive, send) File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/starlette/routing.py", line 72, in app response = await func(request) File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app raw_response = await run_endpoint_function( File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function return await dependant.call(values) File "/home/users/liuyi33/temp/workshop/LLaMA-Factory/src/llamafactory/api/app.py", line 99, in create_chat_completion return await create_chat_completion_response(request, chat_model) File "/home/users/liuyi33/temp/workshop/LLaMA-Factory/src/llamafactory/api/chat.py", line 148, in create_chat_completion_response responses = await chat_model.achat( File "/home/users/liuyi33/temp/workshop/LLaMA-Factory/src/llamafactory/chat/chat_model.py", line 72, in achat return await self.engine.chat(messages, system, tools, image, input_kwargs) File "/home/users/liuyi33/temp/workshop/LLaMA-Factory/src/llamafactory/chat/hf_engine.py", line 296, in chat return await loop.run_in_executor(pool, self._chat, input_args) File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(self.args, *self.kwargs) File "/home/users/liuyi33/.conda/envs/llama_factory/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, **kwargs) File "/home/users/liuyi33/temp/workshop/LLaMA-Factory/src/llamafactory/chat/hf_engine.py", line 186, in _chat gen_kwargs, prompt_length = HuggingfaceEngine._process_args( File "/home/users/liuyi33/temp/workshop/LLaMA-Factory/src/llamafactory/chat/hf_engine.py", line 95, in _process_args promptids, = template.encode_oneturn( File "/home/users/liuyi33/temp/workshop/LLaMA-Factory/src/llamafactory/data/template.py", line 60, in encode_oneturn encoded_pairs = self._encode(tokenizer, messages, system, tools, cutoff_len, reserved_label_len) File "/home/users/liuyi33/temp/workshop/LLaMA-Factory/src/llamafactory/data/template.py", line 111, in _encode tool_text = self.format_tools.apply(content=tools)[0] if tools else "" File "/home/users/liuyi33/temp/workshop/LLaMA-Factory/src/llamafactory/data/formatter.py", line 225, in apply return [self._tool_formatter(tools) if len(tools) != 0 else ""] File "/home/users/liuyi33/temp/workshop/LLaMA-Factory/src/llamafactory/data/formatter.py", line 47, in default_tool_formatter for name, param in tool["parameters"]["properties"].items(): KeyError: 'properties'
一用就报错了,vllm和普通推理都会报错。
Others
No response