loading many models 1 after another corrupts ollama

iplayfast commented 11 months ago

I feel this is a major bug, as anyone using ollama for an extended time using several models will have the same issue.

I'm using https://github.com/iplayfast/OllamaPlayground/tree/main/createnotes#readme which tests all the models on your system. It initially loads each model and says hello just to test. This is where the problem lies.

ollama serve Error: listen tcp 127.0.0.1:11434: bind: address already in use

These are my models:

 ollama list
NAME                            ID              SIZE    MODIFIED     
chris/mr_t:latest               e792712b8728    3.8 GB  9 hours ago     
DrunkSally:latest               7b378c3757fc    3.8 GB  7 days ago      
Guido:latest                    158599e734fb    26 GB   7 days ago      
Jim:latest                      2c7476fb37de    3.8 GB  6 weeks ago     
Mario:latest                    902e3a8e5ed7    3.8 GB  6 weeks ago     
MrT:latest                      e792712b8728    3.8 GB  8 days ago      
Polly:latest                    19982222ada1    4.1 GB  5 weeks ago     
Sally:latest                    903b51bbe623    3.8 GB  10 days ago     
Ted:latest                      fdabf1286f32    4.1 GB  7 days ago      
alfred:latest                   e46325710c52    23 GB   4 weeks ago     
bakllava:latest                 3dd68bd4447c    4.7 GB  4 days ago      
codebooga:latest                05b83c5673dc    19 GB   5 weeks ago     
codellama:latest                8fdf8f752f6e    3.8 GB  3 weeks ago     
codeup:latest                   54289661f7a9    7.4 GB  6 weeks ago     
deepseek-coder:33b              2941d6ab92f3    18 GB   4 weeks ago     
deepseek-coder:latest           140a485970a6    776 MB  2 weeks ago     
deepseek-llm:latest             9aab369a853b    4.0 GB  9 days ago      
dolphin-mixtral:latest          4b33b01bf336    26 GB   8 days ago      
everythinglm:latest             bf6610a21b1e    7.4 GB  6 weeks ago     
falcon:latest                   4280f7257e73    4.2 GB  11 hours ago    
llama2:latest                   fe938a131f40    3.8 GB  6 weeks ago     
llama2-uncensored:latest        44040b922233    3.8 GB  4 weeks ago     
llava:latest                    e4c3eb471fd8    4.5 GB  9 days ago      
magicoder:latest                8007de06f5d9    3.8 GB  2 weeks ago     
medllama2:latest                a53737ec0c72    3.8 GB  6 weeks ago     
mistral:7b                      d364aa8d131e    4.1 GB  6 weeks ago     
mistral:instruct                8aa307f73b26    4.1 GB  2 months ago    
mistral:latest                  8aa307f73b26    4.1 GB  2 months ago    
mistral:text                    3e3d0b9dcb6a    4.1 GB  6 weeks ago     
mistrallite:latest              5393d4f5f262    4.1 GB  6 weeks ago     
mixtralcpu:latest               8fca5114ed19    26 GB   9 hours ago     
neural-chat:latest              f4c6a8e532e8    4.1 GB  2 weeks ago     
nexusraven:latest               336957c1d527    7.4 GB  6 weeks ago     
openhermes2.5-mistral:latest    ca4cd4e8a562    4.1 GB  5 weeks ago     
orca2:13b                       a8dcfac3ac32    7.4 GB  4 weeks ago     
orca2:latest                    ea98cc422de3    3.8 GB  2 weeks ago     
phi:latest                      e22226989b6c    1.6 GB  3 days ago      
phind-codellama:latest          64cce35068a2    19 GB   5 weeks ago     
samantha-mistral:latest         f7c8c9be1da0    4.1 GB  6 weeks ago     
solar:latest                    059fdabbe6e6    6.1 GB  5 days ago      
sqlcoder:latest                 77ac14348387    4.1 GB  6 weeks ago     
stablelm-zephyr:latest          7c596e78b1fc    1.6 GB  2 weeks ago     
starling-lm:latest              0eab7e16513a    4.1 GB  3 weeks ago     
uncensored:latest               8fb4f61e2281    8.9 GB  2 days ago      
wizard-math:latest              5ab8dc2115d3    4.1 GB  9 hours ago     
wizard-vicuna-uncensored:7b     72fc3c2b99dc    3.8 GB  10 days ago     
wizard-vicuna-uncensored:latest 72fc3c2b99dc    3.8 GB  5 weeks ago     
wizardlm-uncensored:latest      886a369d74fc    7.4 GB  13 days ago     
xwinlm:latest                   0fa68068d970    3.8 GB  6 weeks ago     
yarn-mistral:latest             8e9c368a0ae4    4.1 GB  9 days ago      
yi:latest                       59e2d70c6939    3.5 GB  10 days ago     
zephyr:latest                   1629f2a8a495    4.1 GB  6 weeks ago

This is the output after loading them one after another:

python CreateNotes.py 
Attempting to load each model to see if they can be loaded
   attempting to load model chris/mr_t:latest
         model chris/mr_t:latest loaded in 4.0 seconds
   attempting to load model DrunkSally:latest
         model DrunkSally:latest loaded in 8.7 seconds
   attempting to load model Guido:latest
         model Guido:latest loaded in 33.1 seconds
   attempting to load model Jim:latest
         model Jim:latest loaded in 4.0 seconds
   attempting to load model Mario:latest
         model Mario:latest loaded in 1.2 seconds
   attempting to load model MrT:latest
         model MrT:latest loaded in 0.7 seconds
   attempting to load model Polly:latest
         model Polly:latest loaded in 37.9 seconds
   attempting to load model Sally:latest
         model Sally:latest loaded in 2.1 seconds
   attempting to load model Ted:latest
         model Ted:latest loaded in 1.7 seconds
   attempting to load model alfred:latest
         model alfred:latest loaded in 155.2 seconds
   attempting to load model bakllava:latest
         model bakllava:latest loaded in 32.8 seconds
   attempting to load model codebooga:latest
         model codebooga:latest loaded in 110.1 seconds
   attempting to load model codellama:latest
         model codellama:latest loaded in 24.7 seconds
   attempting to load model codeup:latest
         model codeup:latest loaded in 52.7 seconds
   attempting to load model deepseek-coder:33b
         model deepseek-coder:33b loaded in 107.6 seconds
   attempting to load model deepseek-coder:latest
         model deepseek-coder:latest loaded in 7.0 seconds
   attempting to load model deepseek-llm:latest
         model deepseek-llm:latest loaded in 14.4 seconds
   attempting to load model dolphin-mixtral:latest
         model dolphin-mixtral:latest loaded in 93.7 seconds
   attempting to load model everythinglm:latest
         model everythinglm:latest loaded in 48.2 seconds
   attempting to load model falcon:latest
         model falcon:latest loaded in 34.2 seconds
   attempting to load model llama2:latest
         model llama2:latest loaded in 5.9 seconds
   attempting to load model llama2-uncensored:latest
         model llama2-uncensored:latest loaded in 1.8 seconds
   attempting to load model llava:latest
         model llava:latest loaded in 6.2 seconds
   attempting to load model magicoder:latest
         model magicoder:latest loaded in 3.7 seconds
   attempting to load model medllama2:latest
         model medllama2:latest loaded in 22.5 seconds
   attempting to load model mistral:7b
         model mistral:7b loaded in 26.0 seconds
   attempting to load model mistral:instruct
         model mistral:instruct loaded in 0.1 seconds
   attempting to load model mistral:latest
         model mistral:latest loaded in 0.1 seconds
   attempting to load model mistral:text
         model mistral:text loaded in 36.3 seconds
   attempting to load model mistrallite:latest
         model mistrallite:latest loaded in 38.2 seconds
   attempting to load model mixtralcpu:latest
Error: Ollama call failed with status code 500. Details: timed out waiting for llama runner to start
         model mixtralcpu:latest ------------not loaded------------ in 182.2 seconds
   attempting to load model nexusraven:latest
         model nexusraven:latest loaded in 60.8 seconds
   attempting to load model openhermes2.5-mistral:latest
         model openhermes2.5-mistral:latest loaded in 25.0 seconds
   attempting to load model orca2:13b
         model orca2:13b loaded in 46.4 seconds
   attempting to load model orca2:latest
         model orca2:latest loaded in 25.7 seconds
   attempting to load model phi:latest
         model phi:latest loaded in 17.9 seconds
   attempting to load model phind-codellama:latest
         model phind-codellama:latest loaded in 123.8 seconds
   attempting to load model samantha-mistral:latest
         model samantha-mistral:latest loaded in 33.1 seconds
   attempting to load model solar:latest
         model solar:latest loaded in 42.2 seconds
   attempting to load model sqlcoder:latest
Timed out after 300 seconds for question: are you there

sqlcoder isn't a big model. I had originally thought meditron was the problem so I removed it. and it just went to the next one. mixtralcpu is from https://ollama.ai/chris/mixtralcpu which uses loads into memory instead of the gpu. (It loaded from command line fine).

atorr0 commented 11 months ago

Hi, Do you have tried with systemctl restart ollama.service after each attempt?

iplayfast commented 11 months ago

Yes, that does clear the problem, but of course by then the program is borked. It isn't a good fix, if that is what you are suggesting. But it does reset ollama

BruceMacD commented 10 months ago

Thanks for reporting this @iplayfast I think this could have been fixed in the most recent release. Please let me know if you're still seeing issues.

iplayfast commented 10 months ago

No, still occurs... Some thoughts:

if two users are using two models and Ollama is swapping them back and forth as needed,

Where are the conversations saved?
Is that memory being saved/restored at each swap as well, or is the memory potentially growing and eventually interfering with the swap.

-In my latest version of my software I load the largest models first and work my way down to the smallest, asking an assortment of questions, and evaluating the answers with the mistral model. In otherwords many model swaps. At about the 4th one down it dies. I'll push it so you can test yourself.

iplayfast commented 10 months ago

version 0.1.20 did better, but my torture test still killed it.

python CreateNotes.py 
mixtral:latest
notux:latest
dolphin-mixtral:latest
Guido:latest
alfred:latest
phind-codellama:latest
codebooga:latest
deepseek-coder:33b
nexusraven:latest
everythinglm:latest
orca2:13b
codeup:latest
wizardlm-uncensored:latest
eas/nous-hermes-2-solar-10.7b:latest
solar:latest
llama-pro:latest
bakllava:latest
llava:latest
falcon:latest
Error: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))
Error: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/generate/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f80f9cdba10>: Failed to establish a new connection: [Errno 111] Connection refused'))
Error: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/generate/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f80f9cf6f90>: Failed to establish a new connection: [Errno 111] Connection refused'))
Error: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/generate/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f80f9cd8b50>: Failed to establish a new connection: [Errno 111] Connection refused'))
Error: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/generate/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f80f9cf7110>: Failed to establish a new connection: [Errno 111] Connection refused'))
Error: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/generate/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f80f9cfe750>: Failed to establish a new connection: [Errno 111] Connection refused'))
Error: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/generate/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f80f9cfe5d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
Error: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/generate/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f80f9cf74d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
Error: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/generate/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f80f9d01f10>: Failed to establish a new connection: [Errno 111] Connection refused'))
Error: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/generate/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f80f9cd8a10>: Failed to establish a new connection: [Errno 111] Connection refused'))
Error: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/generate/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f80f9cf6b90>: Failed to establish a new connection: [Errno 111] Connection refused'))
Error: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/generate/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f80f9d0d650>: Failed to establish a new connection: [Errno 111] Connection refused'))
Error: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/generate/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f80f9cf5990>: Failed to establish a new connection: [Errno 111] Connection refused'))

iplayfast commented 10 months ago

I became suspicious when after testing again, it died on falcon again. So I tried falcon on it's own. It died. I tried removing falcon and reinstalling it. Still died. The problem might be with falcon.

dhiltgen commented 10 months ago

Could you capture server logs from the time around the crash?

iplayfast commented 10 months ago

I just finished running it with version 0.1.22 and it made it much farther in the test. It now doesn't crash but seems to be stuck in some infinite loop. While the test was running I did a systemctl restart ollama and it carried on after missing a few questions. I've updated my stress test so that questions are asked first and then evaluated after so there is less swapping of llms. github repo (see above) has been updated with CreateNotes and ViewResutls and the results.json. The questions are asked from largest model to smallest.

As for server logs, where would they be located, as I can't find them?

My current models are:

ollama list
NAME                                    ID              SIZE    MODIFIED     
chris/openhermes-agent:latest           c674d4614455    5.1 GB  10 days ago     
eas/nous-hermes-2-solar-10.7b:latest    5986dba75154    6.5 GB  3 weeks ago     
DrunkSally:latest                       7b378c3757fc    3.8 GB  6 weeks ago     
Guido:latest                            158599e734fb    26 GB   6 weeks ago     
Jim:latest                              2c7476fb37de    3.8 GB  2 months ago    
Mario:latest                            902e3a8e5ed7    3.8 GB  2 months ago    
MrT:latest                              e792712b8728    3.8 GB  6 weeks ago     
Polly:latest                            19982222ada1    4.1 GB  2 months ago    
Sally:latest                            903b51bbe623    3.8 GB  6 weeks ago     
Ted:latest                              fdabf1286f32    4.1 GB  6 weeks ago     
alfred:latest                           e46325710c52    23 GB   2 months ago    
codebooga:latest                        05b83c5673dc    19 GB   2 months ago    
codellama:latest                        8fdf8f752f6e    3.8 GB  2 months ago    
codeup:latest                           54289661f7a9    7.4 GB  2 months ago    
deepseek-coder:33b                      acec7c0b0fd9    18 GB   3 weeks ago     
deepseek-coder:latest                   3ddd2d3fc8d2    776 MB  3 weeks ago     
deepseek-llm:latest                     9aab369a853b    4.0 GB  6 weeks ago     
dolphin-mistral:latest                  ecbf896611f5    4.1 GB  2 weeks ago     
dolphin-mixtral:latest                  cfada4ba31c7    26 GB   3 weeks ago     
dolphin-phi:latest                      c5761fc77240    1.6 GB  5 weeks ago     
duckdb-nsql:latest                      7a42116386ac    3.8 GB  3 days ago      
everythinglm:latest                     b005372bc34b    7.4 GB  3 weeks ago     
llama-pro:latest                        fc5c0d744444    4.7 GB  2 weeks ago     
llama2:13b                              d475bf4c50bc    7.4 GB  6 days ago      
llama2:70b                              e7f6c06ffef4    38 GB   6 days ago      
llama2:7b                               78e26419b446    3.8 GB  6 days ago      
llama2:latest                           78e26419b446    3.8 GB  3 weeks ago     
llama2-uncensored:latest                44040b922233    3.8 GB  2 months ago    
llava:latest                            cd3274b81a85    4.5 GB  3 weeks ago     
magicoder:latest                        8007de06f5d9    3.8 GB  7 weeks ago     
medllama2:latest                        a53737ec0c72    3.8 GB  2 months ago    
mistral:7b                              61e88e884507    4.1 GB  3 weeks ago     
mistral:instruct                        61e88e884507    4.1 GB  3 weeks ago     
mistral:latest                          61e88e884507    4.1 GB  3 weeks ago     
mistral:text                            d19e34de4cb6    4.1 GB  3 weeks ago     
mistrallite:latest                      5393d4f5f262    4.1 GB  2 months ago    
mixtral:latest                          7708c059a8bb    26 GB   3 weeks ago     
neural-chat:latest                      89fa737d3b85    4.1 GB  3 weeks ago     
nexusraven:latest                       483a8282af74    7.4 GB  11 days ago     
notus:latest                            43c512e16786    4.1 GB  4 weeks ago     
notux:latest                            fe14e7d66184    26 GB   4 weeks ago     
nous-hermes2-mixtral:latest             599da8dce2c1    26 GB   13 days ago     
nsfw:latest                             328546e02f6f    13 GB   3 days ago      
nsfwstoryteller:latest                  328546e02f6f    13 GB   3 days ago      
openhermes:latest                       95477a2659b7    4.1 GB  4 weeks ago     
openhermes-agent:latest                 4d82cc75e3aa    5.1 GB  11 days ago     
openhermes2.5-mistral:latest            ca4cd4e8a562    4.1 GB  2 months ago    
orca-mini:latest                        2dbd9f439647    2.0 GB  6 days ago      
orca2:13b                               a8dcfac3ac32    7.4 GB  2 months ago    
orca2:latest                            ea98cc422de3    3.8 GB  7 weeks ago     
phi:latest                              e2fd6321a5fe    1.6 GB  3 weeks ago     
phind-codellama:latest                  566e1b629c44    19 GB   3 weeks ago     
qwen:latest                             0fddaff90ef5    4.5 GB  6 days ago      
samantha-mistral:latest                 f7c8c9be1da0    4.1 GB  2 months ago    
solar:latest                            059fdabbe6e6    6.1 GB  6 weeks ago     
sqlcoder:latest                         77ac14348387    4.1 GB  2 months ago    
stable-code:latest                      aa5ab8afb862    1.6 GB  11 days ago     
stablelm-zephyr:latest                  0a108dbd846e    1.6 GB  3 weeks ago     
stablelm2:latest                        ea04e74d6b59    982 MB  3 days ago      
starling-lm:latest                      ff4752739ae4    4.1 GB  3 weeks ago     
tinydolphin:latest                      97c9685cc5db    636 MB  3 days ago      
tinyllama:latest                        2644915ede35    637 MB  3 weeks ago     
wizard-math:latest                      5ab8dc2115d3    4.1 GB  5 weeks ago     
wizard-vicuna-uncensored:7b             72fc3c2b99dc    3.8 GB  6 weeks ago     
wizard-vicuna-uncensored:latest         72fc3c2b99dc    3.8 GB  2 months ago    
wizardcoder:latest                      de9d848c1323    3.8 GB  4 weeks ago     
wizardlm-uncensored:latest              886a369d74fc    7.4 GB  7 weeks ago     
xwinlm:latest                           0fa68068d970    3.8 GB  2 months ago    
yarn-mistral:latest                     8e9c368a0ae4    4.1 GB  6 weeks ago     
yi:latest                               a86526842143    3.5 GB  3 weeks ago     
zephyr:latest                           bbe38b81adec    4.1 GB  3 weeks ago

It seemed that sqlcoder started having problems, answering questions in strange ways. The results.json file can be searched for ": No Answer due to error"

The question "what fills you with joy" running from just the command line seemed to give a very long answer. and my software failed here, I restarted the server after several hours. Perhaps that's why as it's a code completion model.

Given that, Code completion models are so different than chat models there should be a way that:

they can be recognised.
a query can have a maximum response time.
a query can have a maximum response length

dhiltgen commented 10 months ago

As for server logs, where would they be located, as I can't find them?

Depends on your platform. Check out https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md

iplayfast commented 10 months ago

yikes that's a lot of data, are you looking for anything in partiuclar? I've included a small sample of around the time. (note to self journalctl -u ollama -S "2024-01-30 17:01:45")

:14:33 FORGE ollama[2004316]: [GIN] 2024/01/29 - 03:14:33 | 200 |     208.013µs |       127.0.0.1 | POST     "/api/show"
Jan 29 03:14:33 FORGE ollama[2004316]: 2024/01/29 03:14:33 gpu.go:140: INFO CUDA Compute Capability detected: 8.9
Jan 29 03:14:33 FORGE ollama[2004316]: 2024/01/29 03:14:33 gpu.go:140: INFO CUDA Compute Capability detected: 8.9
Jan 29 03:14:33 FORGE ollama[2004316]: 2024/01/29 03:14:33 cpu_common.go:11: INFO CPU has AVX2
Jan 29 03:14:33 FORGE ollama[2004316]: 2024/01/29 03:14:33 dyn_ext_server.go:90: INFO Loading Dynamic llm server: /tmp/ollama4251586406/cuda_v11/libext_server.>
Jan 29 03:14:33 FORGE ollama[2004316]: 2024/01/29 03:14:33 dyn_ext_server.go:145: INFO Initializing llama server
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /usr/share/ollama/.ollama/models/blobs>
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv   0:                       general.architecture str              = llama
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv   1:                               general.name str              = teknium
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv   4:                          llama.block_count u32              = 32
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv  11:                          general.file_type u32              = 2
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32002]   = ["<unk>", "<s>", "</s>", "<0>
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32002]   = [0.000000, 0.000000, 0.00000>
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32002]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, >
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 32000
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv  18:               tokenizer.ggml.add_bos_token bool             = true
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv  19:               tokenizer.ggml.add_eos_token bool             = false
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% for message in messages %>
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - kv  21:               general.quantization_version u32              = 2
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - type  f32:   65 tensors
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - type q4_0:  225 tensors
Jan 29 03:14:33 FORGE ollama[2004316]: llama_model_loader: - type q6_K:    1 tensors
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_vocab: special tokens definition check successful ( 261/32002 ).
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: format           = GGUF V3 (latest)
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: arch             = llama
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: vocab type       = SPM
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: n_vocab          = 32002
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: n_merges         = 0
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: n_ctx_train      = 32768
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: n_embd           = 4096
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: n_head           = 32
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: n_head_kv        = 8
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: n_layer          = 32
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: n_rot            = 128
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: n_embd_head_k    = 128
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: n_embd_head_v    = 128
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: n_gqa            = 4
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: n_embd_k_gqa     = 1024
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: n_embd_v_gqa     = 1024
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: n_ff             = 14336
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: n_expert         = 0
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: n_expert_used    = 0
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: rope scaling     = linear
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: freq_base_train  = 10000.0
an 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: freq_scale_train = 1
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: n_yarn_orig_ctx  = 32768
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: rope_finetuned   = unknown
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: model type       = 7B
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: model ftype      = Q4_0
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: model params     = 7.24 B
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: model size       = 3.83 GiB (4.54 BPW)
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: general.name     = teknium
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: BOS token        = 1 '<s>'
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: EOS token        = 32000 '<|im_end|>'
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: UNK token        = 0 '<unk>'
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_print_meta: LF token         = 13 '<0x0A>'
Jan 29 03:14:33 FORGE ollama[2004316]: llm_load_tensors: ggml ctx size =    0.22 MiB
Jan 29 03:14:35 FORGE ollama[2004316]: llm_load_tensors: offloading 32 repeating layers to GPU
Jan 29 03:14:35 FORGE ollama[2004316]: llm_load_tensors: offloading non-repeating layers to GPU
Jan 29 03:14:35 FORGE ollama[2004316]: llm_load_tensors: offloaded 33/33 layers to GPU
Jan 29 03:14:35 FORGE ollama[2004316]: llm_load_tensors:        CPU buffer size =    70.32 MiB
Jan 29 03:14:35 FORGE ollama[2004316]: llm_load_tensors:      CUDA0 buffer size =  3847.56 MiB
Jan 29 03:14:35 FORGE ollama[2004316]: ...................................................................................................
Jan 29 03:14:35 FORGE ollama[2004316]: llama_new_context_with_model: n_ctx      = 2048
Jan 29 03:14:35 FORGE ollama[2004316]: llama_new_context_with_model: freq_base  = 10000.0
Jan 29 03:14:35 FORGE ollama[2004316]: llama_new_context_with_model: freq_scale = 1
Jan 29 03:14:35 FORGE ollama[2004316]: llama_kv_cache_init:      CUDA0 KV buffer size =   256.00 MiB
Jan 29 03:14:35 FORGE ollama[2004316]: llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
Jan 29 03:14:35 FORGE ollama[2004316]: llama_new_context_with_model:  CUDA_Host input buffer size   =    12.01 MiB
Jan 29 03:14:35 FORGE ollama[2004316]: llama_new_context_with_model:      CUDA0 compute buffer size =   156.00 MiB
Jan 29 03:14:35 FORGE ollama[2004316]: llama_new_context_with_model:  CUDA_Host compute buffer size =     8.00 MiB
Jan 29 03:14:35 FORGE ollama[2004316]: llama_new_context_with_model: graph splits (measure): 3
Jan 29 03:14:35 FORGE ollama[2004316]: 2024/01/29 03:14:35 dyn_ext_server.go:156: INFO Starting llama main loop
Jan 29 03:14:35 FORGE ollama[2004316]: [GIN] 2024/01/29 - 03:14:35 | 200 |  2.247827969s |       127.0.0.1 | POST     "/api/chat"
Jan 29 03:14:56 FORGE ollama[2004316]: 2024/01/29 03:14:56 dyn_ext_server.go:170: INFO loaded 0 images
Jan 29 03:14:57 FORGE ollama[2004316]: [GIN] 2024/01/29 - 03:14:57 | 200 |  358.002761ms |       127.0.0.1 | POST     "/api/chat"
Jan 29 03:15:38 FORGE ollama[2004316]: 2024/01/29 03:15:38 dyn_ext_server.go:170: INFO loaded 0 images

iplayfast commented 10 months ago

Here is the function that eventually fails


def get_answer(ollama, question, timeout=1000):
    start_time = time.time()
    result = ''
    """Get an answer from the Ollama model with a timeout."""
    with concurrent.futures.ThreadPoolExecutor() as executor:
        future = executor.submit(ollama, question)
        try:
            result = future.result(timeout=timeout).strip()
        except concurrent.futures.TimeoutError:
            print(f"Timed out after {timeout} seconds for question: {question}")
            result = 'No Answer due to timeout'
        except Exception as e:
            print(f"Error: {e}")
            result =  'No Answer due to error'
    end_time = time.time()
    elapsed_time = end_time - start_time
    return result.strip(), elapsed_time
# Usage in your loop remains the same

Here is log at the time of the timeout. (after 1500 seconds)

Jan 30 20:46:10 FORGE ollama[3131650]: 2024/01/30 20:46:10 dyn_ext_server.go:145: INFO Initializing llama server
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /usr/share/ollama/.ollama/models/blobs/sha256:4a3019290402c9eadf89a3bf793102a52a2a44dd76ea7b07fca53f9cbb789a63 (version GGUF V2)
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - kv   0:                       general.architecture str              = llama
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - kv   1:                               general.name str              = ehartford
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - kv   4:                          llama.block_count u32              = 32
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - kv  11:                          general.file_type u32              = 2
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32002]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32002]   = [0.000000, 0.000000, 0.000000, 0.0000...
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32002]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 32000
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - kv  18:               general.quantization_version u32              = 2
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - type  f32:   65 tensors
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - type q4_0:  225 tensors
Jan 30 20:46:10 FORGE ollama[3131650]: llama_model_loader: - type q6_K:    1 tensors
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_vocab: special tokens definition check successful ( 261/32002 ).
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: format           = GGUF V2
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: arch             = llama
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: vocab type       = SPM
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: n_vocab          = 32002
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: n_merges         = 0
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: n_ctx_train      = 32768
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: n_embd           = 4096
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: n_head           = 32
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: n_head_kv        = 8
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: n_layer          = 32
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: n_rot            = 128
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: n_embd_head_k    = 128
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: n_embd_head_v    = 128
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: n_gqa            = 4
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: n_embd_k_gqa     = 1024
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: n_embd_v_gqa     = 1024
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: n_ff             = 14336
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: n_expert         = 0
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: n_expert_used    = 0
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: rope scaling     = linear
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: freq_base_train  = 10000.0
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: freq_scale_train = 1
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: n_yarn_orig_ctx  = 32768
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: rope_finetuned   = unknown
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: model type       = 7B
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: model ftype      = Q4_0
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: model params     = 7.24 B
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: model size       = 3.83 GiB (4.54 BPW)
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: general.name     = ehartford
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: BOS token        = 1 '<s>'
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: EOS token        = 32000 '<|im_end|>'
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: UNK token        = 0 '<unk>'
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_print_meta: LF token         = 13 '<0x0A>'
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_tensors: ggml ctx size =    0.22 MiB
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_tensors: offloading 32 repeating layers to GPU
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_tensors: offloading non-repeating layers to GPU
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_tensors: offloaded 33/33 layers to GPU
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_tensors:        CPU buffer size =    70.32 MiB
Jan 30 20:46:10 FORGE ollama[3131650]: llm_load_tensors:      CUDA0 buffer size =  3847.56 MiB
Jan 30 20:46:11 FORGE ollama[3131650]: ..................................................................................................
Jan 30 20:46:11 FORGE ollama[3131650]: llama_new_context_with_model: n_ctx      = 2048
Jan 30 20:46:11 FORGE ollama[3131650]: llama_new_context_with_model: freq_base  = 10000.0
Jan 30 20:46:11 FORGE ollama[3131650]: llama_new_context_with_model: freq_scale = 1
Jan 30 20:46:11 FORGE ollama[3131650]: llama_kv_cache_init:      CUDA0 KV buffer size =   256.00 MiB
Jan 30 20:46:11 FORGE ollama[3131650]: llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
Jan 30 20:46:11 FORGE ollama[3131650]: llama_new_context_with_model:  CUDA_Host input buffer size   =    12.01 MiB
Jan 30 20:46:11 FORGE ollama[3131650]: llama_new_context_with_model:      CUDA0 compute buffer size =   156.00 MiB
Jan 30 20:46:11 FORGE ollama[3131650]: llama_new_context_with_model:  CUDA_Host compute buffer size =     8.00 MiB
Jan 30 20:46:11 FORGE ollama[3131650]: llama_new_context_with_model: graph splits (measure): 3
Jan 30 20:46:11 FORGE ollama[3131650]: 2024/01/30 20:46:11 dyn_ext_server.go:156: INFO Starting llama main loop
Jan 30 20:46:11 FORGE ollama[3131650]: 2024/01/30 20:46:11 dyn_ext_server.go:170: INFO loaded 0 images

dhiltgen commented 8 months ago

This should be resolved by #3218

ollama / ollama

loading many models 1 after another corrupts ollama #1691