janhq / cortex

Drop-in, local AI alternative to the OpenAI stack. Multi-engine (llama.cpp, TensorRT-LLM, ONNX). Powers 👋 Jan
https://cortex.so
Apache License 2.0
1.81k stars 98 forks source link

epic: Better Cortex Error Handling and Hardware Compatibility Checks #369

Open 0xSage opened 6 months ago

0xSage commented 6 months ago

Spec

https://www.notion.so/jan-ai/Better-Error-Handling-and-Hardware-Compatibility-Checks-3d6944c3fb7b41429dee3e44432dc3a8?pvs=4

Update:

In Github issues, search Nitro - it is current the number one root cause of a majority of user experience issues when using Jan.

Let's focus on quality improvements to Nitro for a few sprints.


Previous Context

Problem: Currently, Nitro just fails. The process terminates with no clear reasons. Debugging entails asking users for 1000 lines of logs, and making a best guess on where we went wrong.

There are 2 failure states:

  1. User system setup/dependencies issues, see https://github.com/janhq/jan/issues/1683
  2. User input validation error (To alan's pt, this introduces additional logic and latency, so we may choose to prioritize 2)
  3. Esoteric Nitro internal errors from llamacpp, drogon, or any part of its subroutines.

Ideal:

Tasklist

Context: Also @Alan can you queue up an epic/feat for nitro error handling and graceful failures? i.e. people want Nitro to expect common errors, define clear error enums, and fail gracefully, propagating errors to whatever is dependent on it

tikikun commented 5 months ago

I want to raise up a few points.

Nitro did have message, if not then we wouldn't be able to debug from user log in the first place.

Nitro crashing, failing come at advantage since it can be re-opened as a new process. We need to make sure this https://github.com/janhq/nitro/pull/367 is properly done so that thing can be handled in a separation of concern manner.

We are highly dependent on a few low-level project (llama cpp whisper and ggml) which has its own way of handling their error. Handle it from a low level standpoint means that we will also need to maintain the dependencies.

tikikun commented 5 months ago

We must solve the correct priority issue first, nitro node itself should be isolated as a process manager for nitro before moving on to fix harder issue.

tikikun commented 5 months ago

Hi @Van-QA since we already resolved the nitro-node (pending @louis-jan confirmation you can check with him) we just have one single issue in the way llama cpp handle error using abort(), we can just use this epic only? @louis-jan is quite familiar with this issue so can provide recommendation, we can directly fix it in jan.cpp

Van-QA commented 5 months ago

Added the task list based on guidance from Louis. Thanks.