Closed DeutscheGabanna closed 10 months ago
At what point of the setup process does this fail? I think the error may be because the machine had run out (V)RAM space while loading the Llama model into memory. How much RAM or GPU Memory does the machine have?
16 GB of RAM, 6 GB of VRAM
(base) [deutschegabanna@arch ~]$ khoj
[19:50:28] INFO 🌘 Starting Khoj main.py:78
INFO 💬 Setting up conversation processor configure.py:147
INFO 🔍 📜 Setting up text search model indexer.py:172
INFO 🔍 🌄 Setting up image search model indexer.py:176
[19:50:29] INFO 📬 Initializing content index... configure.py:78
INFO Loading content from existing embeddings... indexer.py:408
INFO 💎 Loading markdown notes indexer.py:419
[19:50:30] INFO 🌖 Khoj is ready to use main.py:100
INFO Started server process [14386] server.py:75
INFO Waiting for application startup. on.py:45
INFO Application startup complete. on.py:59
INFO Uvicorn running on http://127.0.0.1:42110 (Press CTRL+C to quit) server.py:206
[19:50:40] INFO 127.0.0.1:38852 - "GET /config HTTP/1.1" 200 h11_impl.py:431
[19:50:41] INFO 127.0.0.1:38852 - "GET /assets/pico.min.css HTTP/1.1" 200 h11_impl.py:431
INFO 127.0.0.1:38858 - "GET /assets/khoj.css HTTP/1.1" 200 h11_impl.py:431
INFO 127.0.0.1:38852 - "GET /assets/icons/khoj-logo-sideways-500.png HTTP/1.1" 200 h11_impl.py:431
INFO 127.0.0.1:38858 - "GET /assets/icons/github.svg HTTP/1.1" 200 h11_impl.py:431
INFO 127.0.0.1:38872 - "GET /assets/icons/notion.svg HTTP/1.1" 200 h11_impl.py:431
INFO 127.0.0.1:38884 - "GET /assets/icons/confirm-icon.svg HTTP/1.1" 200 h11_impl.py:431
INFO 127.0.0.1:38874 - "GET /assets/icons/markdown.svg HTTP/1.1" 200 h11_impl.py:431
INFO 127.0.0.1:38898 - "GET /assets/icons/org.svg HTTP/1.1" 200 h11_impl.py:431
INFO 127.0.0.1:38872 - "GET /assets/icons/pdf.svg HTTP/1.1" 200 h11_impl.py:431
INFO 127.0.0.1:38858 - "GET /assets/icons/plaintext.svg HTTP/1.1" 200 h11_impl.py:431
INFO 127.0.0.1:38852 - "GET /assets/icons/openai-logomark.svg HTTP/1.1" 200 h11_impl.py:431
INFO 127.0.0.1:38884 - "GET /assets/icons/chat.svg HTTP/1.1" 200 h11_impl.py:431
[19:50:45] INFO 💬 Setting up conversation processor configure.py:147
Found model file at /home/deutschegabanna/.cache/gpt4all/llama-2-7b-chat.ggmlv3.q4_0.bin
llama.cpp: loading model from /home/deutschegabanna/.cache/gpt4all/llama-2-7b-chat.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_head_kv = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 5,0e-06
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: freq_base = 10000,0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 3615,73 MB
llama_model_load_internal: mem required = 4013,73 MB (+ 1024,00 MB per state)
llama_new_context_with_model: kv self size = 1024,00 MB
llama_new_context_with_model: max tensor size = 70,31 MB
llama.cpp: using Vulkan on NVIDIA GeForce GTX 1660 Ti
[19:50:48] INFO 127.0.0.1:38898 - "POST /api/config/data/processor/conversation/offline_chat?enable_offline_chat=true HTTP/1.1" 200 h11_impl.py:431
OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacty of 5.80 GiB of which 17.12 MiB is free. Including
non-PyTorch memory, this process has 5.08 GiB memory in use. Of the allocated memory 75.45 MiB is allocated by PyTorch, and 6.55 MiB is reserved by
PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for
Memory Management and PYTORCH_CUDA_ALLOC_CONF
Ok so it does run out of GPU memory
Oh my goodness... 5GB?
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1154 G /usr/lib/Xorg 210MiB |
| 0 N/A N/A 1250 G /usr/bin/gnome-shell 97MiB |
| 0 N/A N/A 1854 G ...ures=SpareRendererForSitePerProcess 67MiB |
| 0 N/A N/A 3982 G /usr/bin/cool-retro-term 100MiB |
| 0 N/A N/A 14860 G /usr/lib/epiphany-search-provider 1MiB |
| 0 N/A N/A 15108 G /usr/bin/kgx 120MiB |
| 0 N/A N/A 15151 C+G ...utschegabanna/miniconda3/bin/python 5044MiB |
+---------------------------------------------------------------------------------------+
Yeah, it is a big model. We could mitigate this of course by allowing you to disable gpu usage when loading the model, but the issue there is that you would be getting really slow inference times then. But that would be mitigated by this commit: https://github.com/khoj-ai/khoj/commit/9677eae79192aed2171a433f4ae4d9adff7afba1 (still pre-release).
If you want to try using it, you can run pip install --pre khoj-assistant
and then run khoj
with like khoj --disable-chat-on-gpu
flag.
Closing this issue, as we have a way to manually switch to using CPU for offline chat. This should allow machines that don't have enough GPU to load the chat model into VRAM. Of course, feel free to re-open this issue, if the manual switch-over didn't work
I know what this error means -
Error allocating memory ErrorOutOfDeviceMemory
. Obviously it thinks it's run out of disc space. But when I check mydf ~/miniconda
where I installed khoj:To be absolutely sure, I checked ALL drives anyway:
Should have plenty of space to load.