h2oai / h2ogpt

Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://codellama.h2o.ai/
http://h2o.ai
Apache License 2.0
10.94k stars 1.2k forks source link

Can I use existing llama.cpp server as inference server? #1666

Closed ChiNoel-osu closed 2 weeks ago

ChiNoel-osu commented 4 weeks ago

I wanted to use my working local llama.cpp server as the inference server. I looked here and I put --inference_server="http://localhost:8080/v1" but it doesn't work.

HF Client Begin: http://localhost:8080/v1 gpt-3.5-turbo
Starting get_model: gpt-3.5-turbo http://localhost:8080/v1
GR Client Begin: http://localhost:8080/v1 gpt-3.5-turbo
Loaded as API: http://localhost:8080/v1/ ✔
GR Client Failed http://localhost:8080/v1 gpt-3.5-turbo: Could not fetch config for http://localhost:8080/v1/
HF Client Begin: http://localhost:8080/v1 gpt-3.5-turbo
Starting get_model: gpt-3.5-turbo http://localhost:8080/v1
GR Client Begin: http://localhost:8080/v1 gpt-3.5-turbo
Loaded as API: http://localhost:8080/v1/ ✔
GR Client Failed http://localhost:8080/v1 gpt-3.5-turbo: Could not fetch config for http://localhost:8080/v1/
HF Client Begin: http://localhost:8080/v1 gpt-3.5-turbo
Starting get_model: gpt-3.5-turbo http://localhost:8080/v1
GR Client Begin: http://localhost:8080/v1 gpt-3.5-turbo
Loaded as API: http://localhost:8080/v1/ ✔
GR Client Failed http://localhost:8080/v1 gpt-3.5-turbo: Could not fetch config for http://localhost:8080/v1/
HF Client Begin: http://localhost:8080/v1 gpt-3.5-turbo
Traceback (most recent call last):
  File "/home/fae/h2ogpt/generate.py", line 20, in <module>
    entrypoint_main()
  File "/home/fae/h2ogpt/generate.py", line 16, in entrypoint_main
    H2O_Fire(main)
  File "/home/fae/h2ogpt/src/utils.py", line 73, in H2O_Fire
    fire.Fire(component=component, command=args)
  File "/home/fae/h2ogpt/venv/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/fae/h2ogpt/venv/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/fae/h2ogpt/venv/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/fae/h2ogpt/src/gen.py", line 2347, in main
    model0, tokenizer0, device = get_model_retry(reward_type=False,
  File "/home/fae/h2ogpt/src/gen.py", line 2718, in get_model_retry
    model1, tokenizer1, device1 = get_model(**kwargs)
  File "/home/fae/h2ogpt/src/gen.py", line 3021, in get_model
    inference_server, gr_client, hf_client = get_client_from_inference_server(inference_server,
  File "/home/fae/h2ogpt/src/gen.py", line 2699, in get_client_from_inference_server
    res = hf_client.generate('What?', max_new_tokens=1)
  File "/home/fae/h2ogpt/venv/lib/python3.10/site-packages/text_generation/client.py", line 284, in generate
    raise parse_error(resp.status_code, payload)
text_generation.errors.NotFoundError: {'code': 404, 'message': 'File Not Found', 'type': 'not_found_error'}

llama.cpp only has completion and embedding routes, don't know if that's the problem.

pseudotensor commented 2 weeks ago

If llama.cpp is in OpenAI mode, you should use chat completion way and add prefix vllm_chat:.

--inference_server="vllm_chat:http://localhost:8080/v1"