Llama fails to install Error allocating memory ErrorOutOfDeviceMemory

DeutscheGabanna commented 10 months ago

I know what this error means - Error allocating memory ErrorOutOfDeviceMemory. Obviously it thinks it's run out of disc space. But when I check my df ~/miniconda where I installed khoj:

df ~/miniconda3/
Filesystem     1K-blocks     Used Available Use% Mounted on
/dev/sda3       78895748 43292868  31549480  58% /home

To be absolutely sure, I checked ALL drives anyway:

Filesystem     1K-blocks     Used Available Use% Mounted on
dev             16382348        0  16382348   0% /dev
run             16390528     5028  16385500   1% /run
efivarfs             128       45        79  37% /sys/firmware/efi/efivars
/dev/sda2       30787492 18338984  10859260  63% /
tmpfs           16390528     7148  16383380   1% /dev/shm
tmpfs           16390528     5120  16385408   1% /tmp
/dev/sda3       78895748 43292860  31549488  58% /home
tmpfs            3278104      108   3277996   1% /run/user/1000

Should have plenty of space to load.

debanjum commented 10 months ago

At what point of the setup process does this fail? I think the error may be because the machine had run out (V)RAM space while loading the Llama model into memory. How much RAM or GPU Memory does the machine have?

DeutscheGabanna commented 10 months ago

16 GB of RAM, 6 GB of VRAM

DeutscheGabanna commented 10 months ago

(base) [deutschegabanna@arch ~]$ khoj
[19:50:28] INFO     🌘 Starting Khoj                                                                                                                                            main.py:78
           INFO     💬 Setting up conversation processor                                                                                                                  configure.py:147
           INFO     🔍 📜 Setting up text search model                                                                                                                      indexer.py:172
           INFO     🔍 🌄 Setting up image search model                                                                                                                     indexer.py:176
[19:50:29] INFO     📬 Initializing content index...                                                                                                                       configure.py:78
           INFO     Loading content from existing embeddings...                                                                                                             indexer.py:408
           INFO     💎 Loading markdown notes                                                                                                                               indexer.py:419
[19:50:30] INFO     🌖 Khoj is ready to use                                                                                                                                    main.py:100
           INFO     Started server process [14386]                                                                                                                            server.py:75
           INFO     Waiting for application startup.                                                                                                                              on.py:45
           INFO     Application startup complete.                                                                                                                                 on.py:59
           INFO     Uvicorn running on http://127.0.0.1:42110 (Press CTRL+C to quit)                                                                                         server.py:206
[19:50:40] INFO     127.0.0.1:38852 - "GET /config HTTP/1.1" 200                                                                                                           h11_impl.py:431
[19:50:41] INFO     127.0.0.1:38852 - "GET /assets/pico.min.css HTTP/1.1" 200                                                                                              h11_impl.py:431
           INFO     127.0.0.1:38858 - "GET /assets/khoj.css HTTP/1.1" 200                                                                                                  h11_impl.py:431
           INFO     127.0.0.1:38852 - "GET /assets/icons/khoj-logo-sideways-500.png HTTP/1.1" 200                                                                          h11_impl.py:431
           INFO     127.0.0.1:38858 - "GET /assets/icons/github.svg HTTP/1.1" 200                                                                                          h11_impl.py:431
           INFO     127.0.0.1:38872 - "GET /assets/icons/notion.svg HTTP/1.1" 200                                                                                          h11_impl.py:431
           INFO     127.0.0.1:38884 - "GET /assets/icons/confirm-icon.svg HTTP/1.1" 200                                                                                    h11_impl.py:431
           INFO     127.0.0.1:38874 - "GET /assets/icons/markdown.svg HTTP/1.1" 200                                                                                        h11_impl.py:431
           INFO     127.0.0.1:38898 - "GET /assets/icons/org.svg HTTP/1.1" 200                                                                                             h11_impl.py:431
           INFO     127.0.0.1:38872 - "GET /assets/icons/pdf.svg HTTP/1.1" 200                                                                                             h11_impl.py:431
           INFO     127.0.0.1:38858 - "GET /assets/icons/plaintext.svg HTTP/1.1" 200                                                                                       h11_impl.py:431
           INFO     127.0.0.1:38852 - "GET /assets/icons/openai-logomark.svg HTTP/1.1" 200                                                                                 h11_impl.py:431
           INFO     127.0.0.1:38884 - "GET /assets/icons/chat.svg HTTP/1.1" 200                                                                                            h11_impl.py:431
[19:50:45] INFO     💬 Setting up conversation processor                                                                                                                  configure.py:147
Found model file at  /home/deutschegabanna/.cache/gpt4all/llama-2-7b-chat.ggmlv3.q4_0.bin
llama.cpp: loading model from /home/deutschegabanna/.cache/gpt4all/llama-2-7b-chat.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5,0e-06
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: freq_base  = 10000,0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 3615,73 MB
llama_model_load_internal: mem required  = 4013,73 MB (+ 1024,00 MB per state)
llama_new_context_with_model: kv self size  = 1024,00 MB
llama_new_context_with_model: max tensor size =    70,31 MB
llama.cpp: using Vulkan on NVIDIA GeForce GTX 1660 Ti
[19:50:48] INFO     127.0.0.1:38898 - "POST /api/config/data/processor/conversation/offline_chat?enable_offline_chat=true HTTP/1.1" 200                                    h11_impl.py:431

DeutscheGabanna commented 10 months ago

OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacty of 5.80 GiB of which 17.12 MiB is free. Including                        
                    non-PyTorch memory, this process has 5.08 GiB memory in use. Of the allocated memory 75.45 MiB is allocated by PyTorch, and 6.55 MiB is reserved by                   
                    PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for                     
                    Memory Management and PYTORCH_CUDA_ALLOC_CONF

Ok so it does run out of GPU memory

DeutscheGabanna commented 10 months ago

Oh my goodness... 5GB?

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1154      G   /usr/lib/Xorg                               210MiB |
|    0   N/A  N/A      1250      G   /usr/bin/gnome-shell                         97MiB |
|    0   N/A  N/A      1854      G   ...ures=SpareRendererForSitePerProcess       67MiB |
|    0   N/A  N/A      3982      G   /usr/bin/cool-retro-term                    100MiB |
|    0   N/A  N/A     14860      G   /usr/lib/epiphany-search-provider             1MiB |
|    0   N/A  N/A     15108      G   /usr/bin/kgx                                120MiB |
|    0   N/A  N/A     15151    C+G   ...utschegabanna/miniconda3/bin/python     5044MiB |
+---------------------------------------------------------------------------------------+

sabaimran commented 10 months ago

Yeah, it is a big model. We could mitigate this of course by allowing you to disable gpu usage when loading the model, but the issue there is that you would be getting really slow inference times then. But that would be mitigated by this commit: https://github.com/khoj-ai/khoj/commit/9677eae79192aed2171a433f4ae4d9adff7afba1 (still pre-release).

If you want to try using it, you can run pip install --pre khoj-assistant and then run khoj with like khoj --disable-chat-on-gpu flag.

debanjum commented 10 months ago

Closing this issue, as we have a way to manually switch to using CPU for offline chat. This should allow machines that don't have enough GPU to load the chat model into VRAM. Of course, feel free to re-open this issue, if the manual switch-over didn't work

khoj-ai / khoj

Llama fails to install Error allocating memory ErrorOutOfDeviceMemory #521