Incompatible with continuedev chat and code completion

lrq3000 commented 8 months ago

Bug Report

I tried to use GPT4All as a local LLM server with an OpenAI-like API for serving as a code copilot via the continue plugin for VSCode.

Unfortunately, whatever I tried, it did not work.

The server is correctly detected and all models are correctly loaded, using continue prerelease. However, when trying to send any message to gpt4all from continue, the response seems to be empty.

However, when I do my own curl query it works, so I don't know how to debug this further.

I have tried with Ollama and Koboldcpp (via the OpenAI-like API, same settings as for GPT4All - of course I changed the ports), and it worked for both flawlessly.

This seems to me to be an incompatibility in the API. Continue is expecting something that GPT4All is not providing or not in the expected format.

Steps to Reproduce

Install GPT4All and enable the OpenAI-like API, change port to 8000, then restart.
Install the continue extension in VSCode, switch to prerelease.
In the Continue tab, click on the "+" at the bottom left of the panel to add a new server, then select "Other OpenAI-compatible API". Then select "Autodetect".
Go back where there was the "+", now click on the button with text just on the left, and select a model from GPT4All (it should appear as "OpenAI - name_of_model".
Try to chat with the model from continue (just input some text in the textbox above in the same panel". After some time to load, there will be no response, but no error either, it will just go to the next textbox as if the model correctly answered but it was just an empty response.

Expected Behavior

Continue should get non-empty responses from GPT4All.

Your Environment

GPT4All version (if applicable): v2.7.3 (latest at the time of writing)
Operating System: Windows 11 Pro
Chat model used (if applicable): deepseek-coder-1.3b (but I also tried 6.7b, stablecode-3b, openhermes-2.5, etc).

zwilch commented 8 months ago

Can you use wireshark on loopback device to watch communication and can show what happenen on communication beetween client and gpt4all?

May this code examples help you:

node js

const   OpenAI  = require ('openai');

const openai = new OpenAI({
  baseURL:'http://127.0.0.1:4891/v1',
  apiKey: "not needed for a local LLM",

});

async function main() {
 const text = await openai.chat.completions.create(
{ 
  messages: [{ role: 'user', content: '}], 
  model: 'Nous Hermes 2 Mistral DPO',
        max_tokens: 1024 ,
        n: 1,
        stop: null,
        temperature: 0.35,
        top_p:0.75,
        stream:false,
        },
 {  maxRetries: 5,}  )

 console.log(  text );
 console.log( text.choices);
}//main

main();

in Browser as fetch

const json_completion = JSON.stringify(
{stream:false,
 temperatur:0.6,
 max_tokens:100,
 messages:[{role: "user", content:"Hello"},],
 model: 'Nous Hermes 2 Mistral DPO'
}
);
const completions = await fetch("http://127.0.0.1:4891/v1/chat/completions",{
    keepalive: true,
    method: "POST",
    mode: "no-cors",  
// with this mode it will get a response , 
// but for security reason js in browser can not access the result of "await completions.json()"
 headers: {
    Accept: 'application/json',
    'Content-Type': 'application/json',
    'Access-Control-Allow-Origin': "*",
    'Access-Control-Allow-Headers': "*"
    },
    body:json_completion
    });

const completionjson = await completions.json();
/* here this is a problem, 
 *  cause the browser can not do mode:"no-corse" and
 *  after finish the request do "completions.json()"
 * this results in an error like
 * Uncaught (in promise) SyntaxError: JSON.parse: unexpected end of data at line 1 column 1 of the JSON data 
 * see https://stackoverflow.com/questions/54896998/how-to-process-fetch-response-from-an-opaque-type
 * without "mode:"no-corse" you will get an error like
 * XHROPTIONS http://127.0.0.1:4891/v1/chat/completions CORS Preflight Did Not Succeed
 * /
console.log(completionjson);

lrq3000 commented 7 months ago

Thank you @zwilch . I am quite rusty with wireshark, so I'm going to need some time to debug it adequately this way.

Nevertheless, I tried to use curl, an alternative to your two other suggested solutions. And I think this already sheds some light on the issue.

Here is what GPT4All spits out:


$ curl http://localhost:5001/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "deepseek-coder-6.7b-instruct.Q8_0.gguf",
    "messages": [{"role": "user", "content": "Hello! What is your name?"}]
  }'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   489  100   354  100   135     66     25  0:00:05  0:00:05 --:--:--    85{"choices":[{"finish_reason":"length","index":0,"message":{"content":"It seems like you forgot to say anything, could you please tell me again how","role":"assistant"}}],"created":1712489905,"id":"foobarbaz","model":"deepseek-coder-6.7b-instruct.Q8_0.gguf","object":"text_completion","usage":{"completion_tokens":16,"prompt_tokens":20,"total_tokens":36}}

I wrote before that it worked with curl. It indeed appears to do so, but it's only an appearance: when looking at the exact output, it is very much subpar in the quality we could expect, often outputting gibberish sentences and ending mid-sentence.

For comparison, here is what GPT4All outputs when the same model is queried from the GUI:

As an artificial intelligence, I don't have personal experiences or emotions like human beings do. Therefore, I am not named after individuals but rather by the programmers who designed me. My purpose is to assist users in providing information and answering questions based on my programming knowledge base. How can I help you today?

And here is what ollama outputs with the same model and prompt:

$ curl http://localhost:11434/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "deepseek-coder:6.7b-instruct-Q8_0",
    "messages": [{"role": "user", "content": "Hello! What is your name?"}]
  }'
{"id":"chatcmpl-721","object":"chat.completion","created":1712479256,"model":"deepseek-coder:latest","system_fingerprint":"fp_ollama","choices":[{"index":0,"message":{"role":"assistant","content":"As an AI Programming Assistant based on DeepSeek's model \"Deepseek Coder\", I don’t have a personal identity so it can be any person who has access to my features or services, such as the ability to respond in many languages.  My design is focused around providing help and information related to computer science topics within this context of AI programming assistant service. How may I assist you with your coding needs today?\n"},"finish_reason":"stop"}],"usage":{"prompt_tokens":76,"completion_tokens":91,"total_tokens":167}}

So it seems that it's not just a formatting issue, but the GPT4All OpenAI-like API server does not work respond to queries the same way. It seems that it forgets the default parameters maybe? Because it outputs total gibberish, often stopping mid-sentence.

So this issue is not only related to continuedev it seems, it's the whole OpenAI-like API server function that seems to be affected.

I am trying to test my hypothesis above that it's because of missing parameters, but for the moment when I try to input the parameters it takes an infinite time to generate.

cosmic-snow commented 7 months ago

Sorry, I haven't read through everything here, but it might be a templates/parameters issue, so:

Note that many models don't work all that well if you don't provide them with the expected templates. I don't think these are added automatically to any of the web API endpoints. Also, the parameters can have a big influence, too.

What you should try:

First of all, use the chat GUI itself with simple example prompt; for the provided models, the chat application automatically downloads and uses appropriate templates.
Set temperature to zero during your tests, so that an example conversation is reproducible.
Check what templates are in use in the settings, adapt them to an API call.
While testing, make sure to set the same options when making calls through the web API, especially temperature zero.
Test with the web API until you get it right and it produces the same output as in the chat GUI (this can be done with curl, I think).

lrq3000 commented 7 months ago

@cosmic-snow Thank you for your suggestions, and although I will implement them in future tests to improve replicability, this is not a templating/parameters issue, as the model works very fine in GPT4All, and furthermore the issue inside Continue's chat is that it does not output anything, whatever the prompt.

(PS: I know how to edit continue config file, I made it work with several models in koboldcpp including the same model I am trying to use in gpt4all -- koboldcpp is also not supported by default in continue and must be manually configured as an OpenAI-like API server)

cosmic-snow commented 7 months ago

... not a templating/parameters issue, as the model works very fine in GPT4All, and furthermore the issue inside Continue's chat is that it does not output anything, whatever the prompt.

Alright then, but are you sure? I'm not all that familiar with the GUI's API server, but I've spent a bit of time with that recently. It's certainly possible that it's not entirely compatible and something that's expected by the continue plugin is not actually returned by the server.

That is, it definitely doesn't mimic the OpenAI API in full.

However, looking at the output of your previous comment again: GPT4All response excerpt:

... "usage":{"completion_tokens":16,"prompt_tokens":20,"total_tokens":36}

ollama response excerpt:

... "usage":{"prompt_tokens":76,"completion_tokens":91,"total_tokens":167}

Note how many more prompt_tokens it says it has used for the ollama prompt, although your own input is the same in both cases. My hunch here is that ollama adds templates, whereas in GPT4All you'd have to do that manually.

It's entirely possible that this isn't the only issue, though. To get everything to work, I mean. You might also want to run curl -v once in case there's a problem with the HTTP headers (or use a web API tool which shows more details).

I'll probably have a look at the continue plugin when I have some time.

lrq3000 commented 7 months ago

I see, I missed this detail. I'll try to debug this further, but this is getting a bit out of my current abilities, I need to train but I'm not sure when I'll have time to do that... But at least your indications are pointing me to the right direction, I'll post further comments if I find how to do that.

(NB: I wanted to use HTTP Toolkit but it didn't work, then I tried Wireshark but for some reason I cannot see the exchange, I must be mismanipulating, so what remains is Frida.re -- I think it would be more effective if I could catch and manipulate all the exchanges)

hyperstown commented 6 months ago

I tested a few different backends and I think that the issue is that server doesn't support streaming responses and continuedev extension require those.

Every backend that worked returned streaming response.

There is also a parameter stream: true in incoming data:

{"messages":[{"role":"user","content":"hello"}],"model":"Llama 3 Instruct","max_tokens":1024,"stream":true}

zwilch commented 5 months ago

it should do streaming: https://docs.gpt4all.io/gpt4all_python.html#chatting-with-gpt4all

nebulous commented 4 months ago

The GPT4ALL v3.0.0 client has a "Server Chat" section which correctly shows the response to queries received from VSCode/Continue as they arrive, but I can confirm that when configured as the OP suggests at least, these responses don't make it back into Continue.

xieu90 commented 4 months ago

will there be any fix for this?

cosmic-snow commented 4 months ago

Sorry, last time I tried to really look into it I got held up, so I shelved it for a while.

I tested a few different backends and I think that the issue is that server doesn't support streaming responses and continuedev extension require those.

Every backend that worked returned streaming response.

True, the server mode currently doesn't implement streaming responses. If that's a hard requirement, then I guess this is the problem here.

will there be any fix for this?

I can't really say what the plans are right now, sorry. Improvements to the server mode are mentioned on the roadmap, however.

raja-jamwal commented 3 months ago

I got it working with a stopgap solution, https://github.com/continuedev/continue/pull/2097. I'll see if I can make changes to gpt4all to support SSE.

raja-jamwal commented 3 months ago

I've added support for SSE response in this PR, https://github.com/nomic-ai/gpt4all/pull/2910, and tested with prod continue.dev version, it seems to be working.

lrq3000 commented 3 months ago

Awesome @raja-jamwal , thank you so much! I hope this will get merged soon! GPT4All is so much more efficient than other LLM runners such as ollama, I literally cannot run the best models my computer can with other runners.

nomic-ai / gpt4all