Closed vlbosch closed 8 months ago
Behaviour is confirmed in nightly 275.
Also confirmed on nightly 276. I continued trying to get a response. After a few tries it finally produces a response with the context form the first message. I don't see why it couldn't respond on the first message, though:
The issue is still present on nightly 282; I also tried it with a smaller (14B) model. After the first prompt it appears Nitro crashed entirely and had to reload te model. Regenerate and continue got it to answer, though. Since it responses to the given context, I would think the response could be given after the first prompt.
The problem persists in nightly 0.4.7-291. I tried changing multiple parameters (batch size for prompt eval for example, to match the settings in LMStudio) in the JSON, tried to delete all models and threads and manually import them again. It doesn't seem to make a difference. Both Ollama and LMStudio however can answer to the prompt, so it shouldn't be model or hardware constrained. Can I provide any more information, so you can triage the issue better? app.log
I confirmed the issue is still in nightly 298: Jan298.log
hi @vlbosch, can you try using a smaller context length? e.g. 1024
@Van-QA Here is a log where I tried getting a response with various context lengths. When it does generate a response, it is non coherent and stops after a few sentences. Jan298context.log
I retried the same prompts in LMStudio and Ollama, and there the larger context length can be used. Or do they have some behind the scenes magic, to camouflage that the context length is actually too large? Also, deepseek 67b Q6 can be used in Jan with 4096 context length without a problem, which is in the same 'league' as Qwen 72b.
I tried to calculate the required VRAM on https://rahulschand.github.io/gpu_poor/ and Qwen72b Q6 with a context length of 6192 and 2048 tokens to generate, should fit my 128GB RAM easily:
Sorry @vlbosch, could you please share me how you calculated the required VRAM? I only see eGPU options in the gpu_poor tool. Would like to reproduce. I'm investigating the app log ITMT
Hi @louis-jan, I did a new test, because I think another application had a large memory lock that forced the context-shifts you see in the last log. In this new log, I don't see them anymore.
Unfortunately, this time it didn't produce any responses (latest nightly 304) also, not even the gibberish it produced yesterday. I retried with different parameters and also took some screenshots with regard to the memory pressure with different context lengths: app.log
And here the full calculation as per the link provided above. Changing to CPU or another GPU didn't change the outcome, so I figured the calculation is unambiguous
Thank you for the information above; it is helpful. However, I'm confused about why you selected an A6000 GPU. This is an eGPU spec, and you are using a Unified M3 architecture, which is not compatible according to the formula mentioned above.
How does changing -ngl
affect the token's speed and RAM consumption? What happens when you set ngl
to 1?
No problem, please let me know if you need any more information. I selected multiple GPU's and also CPU's and the values didn't change; I wasn't aware the values aren't applicable to the Unified AS architecture, I figured the disclaimer in the FAQ about ±15% was quite universal.
Attached you find a log for ngl=1. The model still doesn't respond: appngl1.log
Thanks for your active. May I ask for the model.json
file you are using to load the model?
It's in jan/models/Qwen1.5-72b-chat-Q6_k
No problem, I like using Jan and Qwen1.5 is one of the best models currently available imho, so I am happy to help. Here you find the requested modelfile: model.json
This behavior could be related to the default prompt_template
in your model.json. Please change it like this and try again.
"prompt_template": "<|im_start|>system\n{system_message}<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant",
This behavior could be related to the default
prompt_template
in your model.json. Please change it like this and try again."prompt_template": "<|im_start|>system\n{system_message}<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant",
Sometimes the solution can be so simple... Thank you very much, this indeed seems to have solved the issue! Weird that it doesn't happen for all prompts and/or context lengths. Because of that, I would've never thought of it. Thanks again!
Sure glad that you can use the model perfectly.
Describe the bug The model is correctly loaded and has previously answered large questions with context. Now after prompting a blank response is shown without clear error.
Steps to reproduce Steps to reproduce the behavior:
Expected behavior Either the generated response for the prompt is expected, or a reason why generation of a response is impossible.
Screenshots If applicable, add screenshots to help explain your issue.
Environment details Operating System: macOS Sonoma 14.3 Jan Version: 0.4.6-274 nightly Processor: M3 Max 16C-40 RAM: 128GB Unified RAM Any additional relevant hardware specifics: 40c GPU
Logs See:
App-blankresponse.log
Additional context The prompt can be loaded and is correctly responded to with for example Ollama or Obaabogaa with the same model and settings.