bug: Model does not answer after correctly loading

janhq / jan

Jan is an open source alternative to ChatGPT that runs 100% offline on your computer. Multiple engine support (llama.cpp, TensorRT-LLM)

https://jan.ai/

GNU Affero General Public License v3.0

23.88k stars 1.39k forks source link

bug: Model does not answer after correctly loading #2094

Closed vlbosch closed 8 months ago

vlbosch commented 9 months ago

Describe the bug The model is correctly loaded and has previously answered large questions with context. Now after prompting a blank response is shown without clear error.

Steps to reproduce Steps to reproduce the behavior:

Open Jan
Start model
Prompt the model
View blank response
Regenerate response, followed by blank
Provide smaller prompts, followed by blank

Expected behavior Either the generated response for the prompt is expected, or a reason why generation of a response is impossible.

Screenshots If applicable, add screenshots to help explain your issue.

Environment details Operating System: macOS Sonoma 14.3 Jan Version: 0.4.6-274 nightly Processor: M3 Max 16C-40 RAM: 128GB Unified RAM Any additional relevant hardware specifics: 40c GPU

Logs See:

App-blankresponse.log

Additional context The prompt can be loaded and is correctly responded to with for example Ollama or Obaabogaa with the same model and settings.

vlbosch commented 9 months ago

Behaviour is confirmed in nightly 275.

vlbosch commented 9 months ago

Also confirmed on nightly 276. I continued trying to get a response. After a few tries it finally produces a response with the context form the first message. I don't see why it couldn't respond on the first message, though:

App-continue.log

vlbosch commented 9 months ago

The issue is still present on nightly 282; I also tried it with a smaller (14B) model. After the first prompt it appears Nitro crashed entirely and had to reload te model. Regenerate and continue got it to answer, though. Since it responses to the given context, I would think the response could be given after the first prompt.

app.log

vlbosch commented 9 months ago

The problem persists in nightly 0.4.7-291. I tried changing multiple parameters (batch size for prompt eval for example, to match the settings in LMStudio) in the JSON, tried to delete all models and threads and manually import them again. It doesn't seem to make a difference. Both Ollama and LMStudio however can answer to the prompt, so it shouldn't be model or hardware constrained. Can I provide any more information, so you can triage the issue better? app.log

vlbosch commented 9 months ago

I confirmed the issue is still in nightly 298: Jan298.log

Van-QA commented 9 months ago

hi @vlbosch, can you try using a smaller context length? e.g. 1024

vlbosch commented 9 months ago

@Van-QA Here is a log where I tried getting a response with various context lengths. When it does generate a response, it is non coherent and stops after a few sentences. Jan298context.log

I retried the same prompts in LMStudio and Ollama, and there the larger context length can be used. Or do they have some behind the scenes magic, to camouflage that the context length is actually too large? Also, deepseek 67b Q6 can be used in Jan with 4096 context length without a problem, which is in the same 'league' as Qwen 72b.

I tried to calculate the required VRAM on https://rahulschand.github.io/gpu_poor/ and Qwen72b Q6 with a context length of 6192 and 2048 tokens to generate, should fit my 128GB RAM easily:

louis-jan commented 9 months ago

Sorry @vlbosch, could you please share me how you calculated the required VRAM? I only see eGPU options in the gpu_poor tool. Would like to reproduce. I'm investigating the app log ITMT

vlbosch commented 9 months ago

Hi @louis-jan, I did a new test, because I think another application had a large memory lock that forced the context-shifts you see in the last log. In this new log, I don't see them anymore.

Unfortunately, this time it didn't produce any responses (latest nightly 304) also, not even the gibberish it produced yesterday. I retried with different parameters and also took some screenshots with regard to the memory pressure with different context lengths: app.log

vlbosch commented 9 months ago

And here the full calculation as per the link provided above. Changing to CPU or another GPU didn't change the outcome, so I figured the calculation is unambiguous

louis-jan commented 8 months ago

Thank you for the information above; it is helpful. However, I'm confused about why you selected an A6000 GPU. This is an eGPU spec, and you are using a Unified M3 architecture, which is not compatible according to the formula mentioned above.

How does changing -ngl affect the token's speed and RAM consumption? What happens when you set ngl to 1?

vlbosch commented 8 months ago

No problem, please let me know if you need any more information. I selected multiple GPU's and also CPU's and the values didn't change; I wasn't aware the values aren't applicable to the Unified AS architecture, I figured the disclaimer in the FAQ about ±15% was quite universal.

Attached you find a log for ngl=1. The model still doesn't respond: appngl1.log

hahuyhoang411 commented 8 months ago

Thanks for your active. May I ask for the model.json file you are using to load the model? It's in jan/models/Qwen1.5-72b-chat-Q6_k

vlbosch commented 8 months ago

No problem, I like using Jan and Qwen1.5 is one of the best models currently available imho, so I am happy to help. Here you find the requested modelfile: model.json

hahuyhoang411 commented 8 months ago

This behavior could be related to the default prompt_template in your model.json. Please change it like this and try again.

"prompt_template": "<|im_start|>system\n{system_message}<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant",

vlbosch commented 8 months ago

This behavior could be related to the default prompt_template in your model.json. Please change it like this and try again.
"prompt_template": "<|im_start|>system\n{system_message}<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant",

Sometimes the solution can be so simple... Thank you very much, this indeed seems to have solved the issue! Weird that it doesn't happen for all prompts and/or context lengths. Because of that, I would've never thought of it. Thanks again!

hahuyhoang411 commented 8 months ago

Sure glad that you can use the model perfectly.