No memory past first conversation for local models?

BarfingLemurs commented 6 months ago

Describe the bug 问题描述 A clear and concise description of what the bug is.

Any of the chat windows do not support continued conversations for local models. I'm not sure if this is a bug or it has not been implemented. Example:

When using local model apis like https://github.com/theroyallab/tabbyAPI, I was unable to continue a conversation, the model only receives my input as its first.

To Reproduce 如何复现 Steps to reproduce the behavior: Using firefox, enter local url: Screenshot_2024-02-21_14-07-24

Please complete the following information): 请补全以下内容

OS: Ubuntu
Browser: Firefox
Extension Version: 2.4.9 Last Updated: February 6, 2024

Additional context 其他 Add any other context about the problem here.

josStorer commented 5 months ago

I think there may be an issue with the local model service you're using. I'm confident that ChatGPTBox has passed history messages through the API. Perhaps you can check if there are any log files.

BarfingLemurs commented 5 months ago

I have been trying some other local apis to see if if there is a problem with the specific backend.

For the openai compatible api in this project: koboldcpp, I can see that the previous messages aren't being sent.

kobold.cpp

Embedded Kobold Lite loaded.
Starting Kobold API on port 5001 at http://localhost:5001/api/
Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/
======
Please connect to custom endpoint at http://localhost:5001

Input: {"messages": [{"role": "user", "content": "hi"}], "model": "", "stream": true, "max_tokens": 40000, "temperature": 1}

Processing Prompt (1 / 1 tokens)
Generating (40 / 2047 tokens)
(EOS token triggered!)
ContextLimit: 41/2048, Processing:0.83s (832.0ms/T), Generation:30.47s (761.8ms/T), Total:31.30s (782.5ms/T = 1.28T/s)
Output:  Question: What is 2704537848 to the power of 1/2, to the nearest integer?
Answer: 51931

Input: {"messages": [{"role": "user", "content": "repeat that."}], "model": "", "stream": true, "max_tokens": 40000, "temperature": 1}

Processing Prompt (0 / 0 tokens)
Generating (1 / 2047 tokens)
(EOS token triggered!)
ContextLimit: 2/2048, Processing:0.00s (0.0ms/T), Generation:0.00s (0.0ms/T), Total:0.00s (0.0ms/T = infT/s)
Output:

Here's a normal log of what should happen:

Input: {"n": 1, "max_context_length": 1600, "max_length": 120, "rep_pen": 1.1, "temperature": 0.7, "top_p": 0.92, "top_k": 100, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 320, "rep_pen_slope": 0.7, "sampler_order": [6, 0, 1, 3, 4, 2, 5], "memory": "", "genkey": "KCPP6614", "min_p": 0, "dynatemp_range": 0, "dynatemp_exponent": 1, "presence_penalty": 0, "logit_bias": {}, "prompt": "\n### Instruction:\nmy favorite sport is soccer. repeat that.\n### Response:\n", "quiet": true, "stop_sequence": ["### Instruction:", "### Response:"], "use_default_badwordsids": false}

Processing Prompt (22 / 22 tokens)
Generating (7 / 120 tokens)
(EOS token triggered!)
ContextLimit: 29/1600, Processing:16.62s (755.5ms/T), Generation:4.75s (679.3ms/T), Total:21.38s (3053.6ms/T = 0.33T/s)
Output: Your favorite sport is soccer.

Input: {"n": 1, "max_context_length": 1600, "max_length": 120, "rep_pen": 1.1, "temperature": 0.7, "top_p": 0.92, "top_k": 100, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 320, "rep_pen_slope": 0.7, "sampler_order": [6, 0, 1, 3, 4, 2, 5], "memory": "", "genkey": "KCPP5896", "min_p": 0, "dynatemp_range": 0, "dynatemp_exponent": 1, "presence_penalty": 0, "logit_bias": {}, "prompt": "\n### Instruction:\nmy favorite sport is soccer. repeat that.\n### Response:\nYour favorite sport is soccer.\n### Instruction:\nrepeat that.\n### Response:\n", "quiet": true, "stop_sequence": ["### Instruction:", "### Response:"], "use_default_badwordsids": false}

Processing Prompt (14 / 14 tokens)
Generating (7 / 120 tokens)
(EOS token triggered!)
ContextLimit: 49/1600, Processing:10.54s (752.9ms/T), Generation:4.80s (686.3ms/T), Total:15.35s (2192.1ms/T = 0.46T/s)
Output: Your favorite sport is soccer.

Input: {"n": 1, "max_context_length": 1600, "max_length": 120, "rep_pen": 1.1, "temperature": 0.7, "top_p": 0.92, "top_k": 100, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 320, "rep_pen_slope": 0.7, "sampler_order": [6, 0, 1, 3, 4, 2, 5], "memory": "", "genkey": "KCPP9206", "min_p": 0, "dynatemp_range": 0, "dynatemp_exponent": 1, "presence_penalty": 0, "logit_bias": {}, "prompt": "\n### Instruction:\nmy favorite sport is soccer. repeat that.\n### Response:\nYour favorite sport is soccer.\n### Instruction:\nrepeat that.\n### Response:\nYour favorite sport is soccer.\n### Instruction:\ngood. say it once more.\n### Response:\n", "quiet": true, "stop_sequence": ["### Instruction:", "### Response:"], "use_default_badwordsids": false}

Processing Prompt (18 / 18 tokens)
Generating (7 / 120 tokens)
(EOS token triggered!)
ContextLimit: 73/1600, Processing:13.57s (754.2ms/T), Generation:4.79s (684.4ms/T), Total:18.37s (2623.7ms/T = 0.38T/s)
Output: Your favorite sport is soccer.

Here's my logs with the llama.cpp server binary:

#### llama.cpp server ``` = 3.86 GiB (4.57 BPW) llm_load_print_meta: general.name = mistralai_mistral-7b-instruct-v0.2 llm_load_print_meta: BOS token = 1 '~~' llm_load_print_meta: EOS token = 2 '~~' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: PAD token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.11 MiB llm_load_tensors: CPU buffer size = 3947.87 MiB .................................................................................................. llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 64.00 MiB llama_new_context_with_model: KV self size = 64.00 MiB, K (f16): 32.00 MiB, V (f16): 32.00 MiB llama_new_context_with_model: CPU input buffer size = 10.01 MiB llama_new_context_with_model: CPU compute buffer size = 72.00 MiB llama_new_context_with_model: graph splits (measure): 1 Available slots: -> Slot 0 - max context: 512 {"timestamp":1709619932,"level":"INFO","function":"main","line":2713,"message":"model loaded"} all slots are idle and system prompt is empty, clear the KV cache slot 0 is processing [task id: 0] slot 0 : kv cache rm - [0, end) print_timings: prompt eval time = 7938.96 ms / 14 tokens ( 567.07 ms per token, 1.76 tokens per second) print_timings: eval time = 33946.96 ms / 56 runs ( 606.20 ms per token, 1.65 tokens per second) print_timings: total time = 41885.92 ms slot 0 released (70 tokens in cache) {"timestamp":1709620055,"level":"INFO","function":"log_server_request","line":2469,"message":"request","remote_addr":"127.0.0.1","remote_port":36862,"status":200,"method":"POST","path":"/v1/chat/completions","params":{}} slot 0 is processing [task id: 59] slot 0 : kv cache rm - [0, end) print_timings: prompt eval time = 6230.06 ms / 11 tokens ( 566.37 ms per token, 1.77 tokens per second) print_timings: eval time = 30168.38 ms / 48 runs ( 628.51 ms per token, 1.59 tokens per second) print_timings: total time = 36398.44 ms slot 0 released (59 tokens in cache) {"timestamp":1709620103,"level":"INFO","function":"log_server_request","line":2469,"message":"request","remote_addr":"127.0.0.1","remote_port":54752,"status":200,"method":"POST","path":"/v1/chat/completions","params":{}} slot 0 is processing [task id: 110] slot 0 : kv cache rm - [0, end) print_timings: prompt eval time = 9683.77 ms / 17 tokens ( 569.63 ms per token, 1.76 tokens per second) print_timings: eval time = 3806.89 ms / 7 runs ( 543.84 ms per token, 1.84 tokens per second) print_timings: total time = 13490.66 ms slot 0 released (24 tokens in cache) {"timestamp":1709620128,"level":"INFO","function":"log_server_request","line":2469,"message":"request","remote_addr":"127.0.0.1","remote_port":60440,"status":200,"method":"POST","path":"/v1/chat/completions","params":{}} slot 0 is processing [task id: 120] slot 0 : kv cache rm - [0, end) print_timings: prompt eval time = 6225.77 ms / 11 tokens ( 565.98 ms per token, 1.77 tokens per second) print_timings: eval time = 41423.46 ms / 63 runs ( 657.52 ms per token, 1.52 tokens per second) print_timings: total time = 47649.23 ms slot 0 released (74 tokens in cache) {"timestamp":1709620188,"level":"INFO","function":"log_server_request","line":2469,"message":"request","remote_addr":"127.0.0.1","remote_port":53026,"status":200,"method":"POST","path":"/v1/chat/completions","params":{}} ``` ![image](https://github.com/josStorer/chatGPTBox/assets/128182951/2e610cde-3c1d-4c41-b172-fcd7535ad025)

are you able to reproduce it with any of the llama.cpp / ollama backends? am I using the wrong api url?

josStorer commented 5 months ago

Did you change your settings? It may be your Max Conversation Length being set to zero.

BarfingLemurs commented 5 months ago

my settings seem ok, here is my video footage of the issue: https://github.com/josStorer/chatGPTBox/assets/128182951/82639627-48e1-4e48-a064-a98eba96a3a0

is it chrome or some other operating system issue? I was actually able to use the extension on an android phone, with the firefox browser. the auto queries it makes with searches work great.

josStorer commented 5 months ago

Refresh the conversation page, does the history messages still exist? Press F12 and click the network section, then select the completion request and click on payload, give me a screenshot

BarfingLemurs commented 5 months ago

Refresh the conversation page, does the history messages still exist?

No, newly created sessions do not persist after refreshing.

then select the completion request and click on payload

I don't know about this, let me know what to do next.

josStorer commented 5 months ago

No, newly created sessions do not persist after refreshing.

This is not normal. If an answer is completed normally, the conversation page should save it correctly, and then when you continue the conversation, it will be sent as a history message.

If it disappears after refreshing, it means that this answer has not been considered complete. ChatGPTBox does not store or send failed or interrupted answers as history messages, which is the same situation you encountered.

For me, using ollama answers can be completed and stored correctly.

BarfingLemurs commented 5 months ago

Thank you, I hadn't tested ollama, but this works properly.

filling the model name eg: "gemma:2b" is a requirement for ollama to work, along with: export OLLAMA_ORIGINS=* on linux

I will check those other apis again.

BarfingLemurs commented 5 months ago

Some notable things with tabbyapi (and other ones):

ending "</" token is not displayed, as shown 36 seconds in
the chat ui doesn't create a border after output response, which I think normally happens with ollama backend.

https://github.com/josStorer/chatGPTBox/assets/128182951/153974dc-5bb8-4aa2-93ef-536242c2508e

josStorer commented 5 months ago

"</" token is actually "</>", and it's rendered as a html element, so not displayed

josStorer commented 5 months ago

v2.5.1

BarfingLemurs commented 5 months ago

Thank you, the conversation now is stored properly and works with these local APIs mentioned, such as tabbyAPI!

josStorer / chatGPTBox

No memory past first conversation for local models? #631

kobold.cpp