gptel should use model set in gptel-backend

tvraman commented 6 months ago

With gptel-backend set:
BEGIN_SRC

(setq-default gptel-backend (gptel-make-ollama "Ollama" :host "localhost:11434" :models '("zephyr") :stream t))

END_SRC
M-x gptel continues to use the limited model list available in gptel-model -- cant be changed in customize and so breaks --
1. Works once I set gptel-model explicitly eg "llama2"

karthink commented 6 months ago

A gptel-backend corresponds to a LLM service (local or remote), which can provide multiple models so the slot (gptel-backend-models some-backend) is a list. In your example, it's the list '("zephyr").

This is why a separate variable, gptel-model, is needed. The suggestion in the README for configuring Ollama is to set both explicitly:

(setq-default gptel-model "mistral:latest" ;Pick your default model
              gptel-backend (gptel-make-ollama "Ollama" :host ...))

I can make it so gptel-model is not limited to the OpenAI models in customize and you can use setopt etc. Is this a sufficient solution, or do you have any ideas for better structuring the model configuration?

ParetoOptimalDev commented 6 months ago

I'm assuming their goal might be what mine is:

Not have to go to textgen or ollama or llama.cpp to have to change the model like I currently do.

If not though I can create a separate issue for that.

ParetoOptimalDev commented 6 months ago

Yeah I just tried with textgen backend:

  (defvar gptel--textgen-webui-openai
    (gptel-make-openai
     "textgen-webui--openai"
     :stream t
     :protocol "http"
     :host "localhost:5000"
     :models '("LoneStriker_Nous-Hermes-2-Yi-34B-3.0bpw-h6-exl2"
           "dolphin-2.1-mistral-7b.Q6_K.gguf"))
    "GPTel backend for textgen webui openai server")

It just uses the model that is loaded.

The UI implies to me though that it will load the model I select.

I guess it's actually just a bit of an impedance mismatch with local vs hosted llms where locally many times you start things yourself in case you already have some server running.

I'm starting to see the value in using different models like say dolphin generally, llama-code-34b for coding, etc.

I'm not sure what a good solution would look like here or if that's something you feel is appropriate for gptel.

karthink commented 6 months ago

It just uses the model that is loaded.

The UI implies to me though that it will load the model I select.

I'm confused here. Are you talking about the transient menu UI where you have to select a backend+model every time you want to change the model?

ParetoOptimalDev commented 6 months ago

Yes. My original thought was:

add gptel--textgen-webui-openai
add list of models in it's model directory
the model I select in gptel transient menu gets loaded

Thinking further, I can see problems with that:

how much vram is the model given
how much memory is the model given

So I can see why it wouldn't work for local llms... at least... I think it doesn't? I guess I'm asking for clarification.

karthink commented 6 months ago

Yes. My original thought was:

* add  gptel--textgen-webui-openai

* add list of models in it's model directory

* the model I select in gptel transient menu gets loaded

Isn't this how it works right now?

ParetoOptimalDev commented 6 months ago

If I don't explicitly go into text-generation-web-ui and load a model, I get an error in emacs from gptel:

Debugger entered--Lisp error: (wrong-type-argument integer-or-marker-p nil)
  pulse-momentary-highlight-region(34 nil)
  gptel-curl--stream-cleanup(#<process gptel-curl> "exited abnormally with code 18\n")

And in text-generation-webui I also see:

Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1106, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 83, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File "/usr/local/lib/python3.10/dist-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 69, in app
    await response(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/sse_starlette/sse.py", line 233, in __call__
    async with anyio.create_task_group() as task_group:
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
    raise exceptions[0]
  File "/usr/local/lib/python3.10/dist-packages/sse_starlette/sse.py", line 236, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/sse_starlette/sse.py", line 221, in stream_response
    async for data in self.body_iterator:
  File "/workspace/text-generation-webui/extensions/openai/script.py", line 127, in generator
    for resp in response:
  File "/workspace/text-generation-webui/extensions/openai/completions.py", line 521, in stream_chat_completions
    for resp in chat_completions_common(body, is_legacy, stream=True):
  File "/workspace/text-generation-webui/extensions/openai/completions.py", line 296, in chat_completions_common
    prompt = generate_chat_prompt(user_input, generate_params)
  File "/workspace/text-generation-webui/modules/chat.py", line 172, in generate_chat_prompt
    while len(messages) > 0 and get_encoded_length(prompt) > max_length:
  File "/workspace/text-generation-webui/modules/text_generation.py", line 160, in get_encoded_length
    return len(encode(prompt)[0])
  File "/workspace/text-generation-webui/modules/text_generation.py", line 119, in encode
    raise ValueError('No tokenizer is loaded')
ValueError: No tokenizer is loaded

Maybe the textgeneration-web-ui open ai api server doesn't support this or has a bug?

ParetoOptimalDev commented 6 months ago

Yeah, I guess it's a problem or lack of support on their end. I just tested with:

curl http://localhost:5000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
     "model": "dolphin-2.7-mixtral-8x7b.Q5_0.gguf",
     "messages": [{"role": "user", "content": "Say this is a test!"}],
     "temperature": 0.7
   }'

Got the same error as above.

ParetoOptimalDev commented 6 months ago

Found this:

Best is to load the model using the Web UI. Once the model is loaded, the API will pick it up.

https://github.com/oobabooga/text-generation-webui/issues/4534#issuecomment-1814549672

karthink commented 6 months ago

The UI implies to me though that it will load the model I select.

I'm still confused about what the issue is. IIUC, choosing a model in gptel does not influence the model that the text-generation-webui actually uses. gptel can't ascertain the availability of the model at the endpoint specified in gptel-make-openai, especially since every backend (Ollama, GPT4All, and now text-generation-webui) uses a different method to make a model available.

Perhaps gptel could test if the backend is "online", so to speak, by sending a test message and noting the status of the response. This is the most we can reasonably do from Emacs. Starting and running these various local LLM providers (Ollama etc) from Emacs is a whole another package's worth of details, and would need constant upkeep too as they're being actively developed.

This is a separate issue from @tvraman's concern, which is yet to be resolved. For further discussion of this, please open a thread in discussions instead.

karthink commented 5 months ago

@tvraman Joao Tavora added "model sanitization" to gptel in 7b19cdf1. It now checks if gptel-model is one of the models specified in the backend, and if it isn't, sets it to the first valid model from the backend before sending the query.

karthink / gptel

gptel should use model set in gptel-backend #152

BEGIN_SRC

END_SRC