serving 运行出错 - Githubissues

maiquanshen commented 10 months ago

/data/openppl/ppl.llm.serving$ ./ppl-build/ppl_llama_server src/models/llama/conf/llama_13b_config_example.json [INFO][2023-09-19 16:51:43.346][llama_server.cc:149] server_config.host: 0.0.0.0 [INFO][2023-09-19 16:51:43.346][llama_server.cc:150] server_config.port: 23333 [INFO][2023-09-19 16:51:43.346][llama_server.cc:152] server_config.model_dir: /data/openppl/ppl.pmx/model_zoo/llama/huggingface/llama_chinese_13b_ppl [INFO][2023-09-19 16:51:43.346][llama_server.cc:153] server_config.model_param_path: /data/openppl/ppl.pmx/model_zoo/llama/huggingface/llama_chinese_13b_ppl/pmx_params.json [INFO][2023-09-19 16:51:43.346][llama_server.cc:154] server_config.tokenizer_path: /data/wenda_llama/wenda-main/model/Chinese-LlaMA2-chat-7B-sft-v0.3 [INFO][2023-09-19 16:51:43.346][llama_server.cc:156] server_config.top_k: 1 [INFO][2023-09-19 16:51:43.346][llama_server.cc:157] server_config.top_p: 0 [INFO][2023-09-19 16:51:43.346][llama_server.cc:159] server_config.tensor_parallel_size: 2 [INFO][2023-09-19 16:51:43.347][llama_server.cc:160] server_config.max_tokens_scale: 0.93 [INFO][2023-09-19 16:51:43.347][llama_server.cc:161] server_config.max_tokens_per_request: 4096 [INFO][2023-09-19 16:51:43.347][llama_server.cc:162] server_config.max_running_batch: 1024 [ERROR][2023-09-19 16:51:43.347][llama_server.cc:221] find key [cache_quant_bit] failed [ERROR][2023-09-19 16:51:43.347][llama_server.cc:561] PaseModelConfig failed, model_param_path: /data/openppl/ppl.pmx/model_zoo/llama/huggingface/llama_chinese_13b_ppl/pmx_params.json

模型我用huggingface的，利用pmx成功进行转换，并使用demo.py测试是成功的，但在serving这里就出错了能提个建议吗，你们的说明文档能写得详细点吗，谢谢就好像pmx里面的转换和export两者是有什么区别吧，都要运行吗

ZhangZhiPku commented 10 months ago

它好像说你少了一个 key: cache_quant_bit

Alcanderian commented 9 months ago

json用错了，应该用export之后的json，convert是将hf的ckpt转换成pmx ckpt，export是将pmx ckpt导出成pmx的onnx文件

openppl-public / ppl.llm.serving

serving 运行出错 #29