khoj-ai / khoj

Your AI second brain. Self-hostable. Get answers from the web or your docs. Build custom agents, schedule automations, do deep research. Turn any online or local LLM into your personal, autonomous AI (e.g gpt, claude, gemini, llama, qwen, mistral).
https://khoj.dev
GNU Affero General Public License v3.0
14.77k stars 735 forks source link

Segmentation Fault in Offline Chat Functionality #367

Closed YungBricoCoop closed 1 year ago

YungBricoCoop commented 1 year ago

Issue Description:

When using the chat functionality in offline mode on a MacBook Air with an M2 chip, the backend crashes entirely due to a segmentation fault. This problem appears to be specific to the offline chat feature, as the search functionality remains unaffected and operates as expected.

Additional Details:

System: MacBook Air (M2 Chip), Ventura 13.4.1 (22F82) Software: Python 3.11.4.

Terminal output:

[16:58:47] INFO     127.0.0.1:49463 - "GET                       h11_impl.py:431
                    /api/chat?q=Qu%27est%20que%20k2&n=6&client=o                
                    bsidian&stream=true HTTP/1.1" 200                           
           INFO     127.0.0.1:49466 - "GET /config HTTP/1.1" 200 h11_impl.py:431
[1]    821 segmentation fault  khoj
/opt/homebrew/Cellar/python@3.11/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
clehene commented 1 year ago

Seeing the same with M1 Max, Ventura 13.5

zakirullin commented 1 year ago

Same, segmentation fault with the offline chat funcionality:

/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

1) Open chat 2) Send any message 3) The bot would reply with just "🤔" 4) Nothing else happens, you can wait indefinitely 5) Send a second message 6) Segmentation Fault

It seems like it hasn't finished processed the first request, and upon the second some semaphores weren't released. I use Khoj with offline markdown files (~3k entries).

Apple M1 Pro, Ventura 13.4.1

P.S. Okay now I got it. The reply takes longer than 44s, so I just had to wait longer.

gramster commented 1 year ago

Not sure if its the same issue but I see a crash too on M1 Mac; looks like this thread:

Thread 17 Crashed: 0 libllamamodel-mainline-default.dylib 0x4d40388d0 ggml_compute_forward + 13820 1 libllamamodel-mainline-default.dylib 0x4d4034dfc ggml_graph_compute + 2036 2 libllamamodel-mainline-default.dylib 0x4d401e450 llama_eval_internal(llama_context&, int const, int, int, int, char const) + 2444 3 libllamamodel-mainline-default.dylib 0x4d401da3c llama_eval + 28 4 libllamamodel-mainline-default.dylib 0x4d400f860 LLamaModel::evalTokens(LLModel::PromptContext&, std::1::vector<int, std::1::allocator> const&) const + 64 5 libllamamodel-mainline-default.dylib 0x4d4010bd0 LLModel::prompt(std::1::basic_string<char, std::__1::char_traits, std::1::allocator> const&, std::1::function<bool (int)>, std::1::function<bool (int, std::1::basic_string<char, std::__1::char_traits, std::1::allocator> const&)>, std::__1::function<bool (bool)>, LLModel::PromptContext&) + 1372 6 libllmodel.dylib 0x107c1a4a4 llmodel_prompt + 588 7 libffi.dylib 0x1a74a0050 ffi_call_SYSV + 80 8 libffi.dylib 0x1a74a8af8 ffi_call_int + 1208 9 libffi.dylib 0x1a74a8af8 ffi_call_int + 1208 10 libffi.dylib 0x1a74a8af8 ffi_call_int + 1208

sabaimran commented 1 year ago

Ah! I finally had a reproduction of this error. I really appreciate the detailed responses in this thread that helped me root cause it. Exactly that -- it happens when you've sent a second query to the LLM when the first one is still being processed.

I'm releasing a bunch of perf improvements to offline chat #393 that should make response faster/more reliable, I hope. But, they will also make it clearer that Llama/Khoj is still processing the request.

Ideally, there should be some way to determine whether the model is occupied. I'll investigate that.