Failed to integrate ollama in Nanos #2033

Closed 3 months ago

leeyiding commented 3 months ago

Hello, I'm sorry to bother you again, but I encountered some problems in the process of building ollama applications, and I need your help.

The ollama version I use is 0.1.31, and my directory tree is as follows, ollama binaries can be downloaded from ollama-linux-amd64 and renamed to ollama.

├── config.json
├── klibs
│   └── gpu_nvidia
├── nvidia
│   ├── 535.113.01
│   │   ├── gsp_ga10x.bin
│   │   └── gsp_tu10x.bin
│   └── LICENSE
├── ollama
├── tmp
└── usr
    └── lib

The configuration file is as follows:

  "Dirs": [
  "MapDirs": {
    "/root/.ollama/*": ".ollama"
  "Args": [
  "Env": {
    "OLLAMA_HOST": "",
    "HOME": "/"
  "KlibDir": "./klibs",
  "Klibs": [
  "RunConfig": {
    "GPUs": 1,
    "Ports": ["11434"]
  "BaseVolumeSz": "2g"

Step 1, run ./ollama serve locally to start a service, and then run./ollama pull qwen: 0.5b in another terminal to pull a model. Step 2, terminate the service started in the previous step, and then run ops run ollama -c config.json -n to run Nanos, the program starts normally at this step Step 3, Call api on another terminal. curl http://localhost:11434/api/generate -d '{ "model": "qwen:0.5b","prompt": "Hello!" }' After completing the step 3, an error occurred. Here is the running log.

running local instance
booting /root/.ops/images/ollama ...
en1: assigned
NVRM _sysCreateOs: RM Access Sys Cap creation failed: 0x56
NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  535.113.01  Release Build  (circleci@12a9ece3f331)  Sun Jun 23 02:11:32 AM UTC 2024
Loaded the UVM driver, major device number 0.
time=2024-06-25T11:11:45.432Z level=INFO source=images.go:804 msg="total blobs: 9"
time=2024-06-25T11:11:45.434Z level=INFO source=images.go:811 msg="total unused blobs removed: 0"
time=2024-06-25T11:11:45.434Z level=INFO source=routes.go:1118 msg="Listening on [::]:11434 (version 0.1.31)"
time=2024-06-25T11:11:45.436Z level=INFO source=payload_common.go:113 msg="Extracting dynamic libraries to /tmp/ollama3666727321/runners ..."
en1: assigned FE80::38F1:91FF:FE78:F331
time=2024-06-25T11:11:48.585Z level=INFO source=payload_common.go:140 msg="Dynamic LLM libraries [rocm_v60000 cpu cpu_avx cpu_avx2 cuda_v11]"
time=2024-06-25T11:11:48.586Z level=INFO source=gpu.go:115 msg="Detecting GPU type"
time=2024-06-25T11:11:48.587Z level=INFO source=gpu.go:265 msg="Searching for GPU management library*"
time=2024-06-25T11:11:48.588Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [/tmp/ollama3666727321/runners/cuda_v11/]"
time=2024-06-25T11:11:49.767Z level=INFO source=gpu.go:120 msg="Nvidia GPU detected via cudart"
time=2024-06-25T11:11:49.768Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-06-25T11:11:49.854Z level=INFO source=gpu.go:188 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9"
time=2024-06-25T11:12:02.111Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-06-25T11:12:02.111Z level=INFO source=gpu.go:188 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9"
time=2024-06-25T11:12:02.112Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-06-25T11:12:02.113Z level=INFO source=gpu.go:188 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9"
time=2024-06-25T11:12:02.113Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
loading library /tmp/ollama3666727321/runners/cuda_v11/
time=2024-06-25T11:12:02.121Z level=INFO source=dyn_ext_server.go:87 msg="Loading Dynamic llm server: /tmp/ollama3666727321/runners/cuda_v11/"
time=2024-06-25T11:12:02.122Z level=INFO source=dyn_ext_server.go:147 msg="Initializing llama server"
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /.ollama/models/blobs/sha256-fad2a06e4cc705c2fa8bec5477ddb00dc0c859ac184c34dcc5586663774161ca (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                      str              = Qwen2-beta-0_5B-Chat
llama_model_loader: - kv   2:                          qwen2.block_count u32              = 24
llama_model_loader: - kv   3:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   4:                     qwen2.embedding_length u32              = 1024
llama_model_loader: - kv   5:                  qwen2.feed_forward_length u32              = 2816
llama_model_loader: - kv   6:                 qwen2.attention.head_count u32              = 16
llama_model_loader: - kv   7:              qwen2.attention.head_count_kv u32              = 16
llama_model_loader: - kv   8:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv   9:                qwen2.use_parallel_residual bool             = true
llama_model_loader: - kv  10:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  11:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  12:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  13:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  15:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  17:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - kv  19:                          general.file_type u32              = 2
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q4_0:  169 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 293/151936 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 1024
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 2816
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 0.5B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 619.57 M
llm_load_print_meta: model size       = 371.02 MiB (5.02 BPW) 
llm_load_print_meta:     = Qwen2-beta-0_5B-Chat
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 24 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 25/25 layers to GPU
llm_load_tensors:        CPU buffer size =    83.46 MiB
llm_load_tensors:      CUDA0 buffer size =   287.57 MiB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   192.00 MiB
llama_new_context_with_model: KV self size  =  192.00 MiB, K (f16):   96.00 MiB, V (f16):   96.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =   298.75 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   298.75 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     6.00 MiB
llama_new_context_with_model: graph nodes  = 868
llama_new_context_with_model: graph splits = 2
pending_fault_complete error: page fill failed with (result:out of memory)
SIGBUS: bus error
PC=0x2778d8e9e m=7 sigcode=2 addr=0x36c29c000
signal arrived during cgo execution

goroutine 20 gp=0xc000007dc0 m=7 mp=0xc00043a808 [syscall]:
pending_fault_complete error: page fill failed with (result:out of memory)
pending_fault_complete error: page fill failed with (result:out of memory)
pending_fault_complete error: page fill failed with (result:out of memory)
pending_fault_complete error: page fill failed with (result:out of memory)
pending_fault_complete error: page fill failed with (result:out of memory)
pending_fault_complete error: page fill failed with (result:out of memory)
pending_fault_complete error: page fill failed with (result:out of memory)
pending_fault_complete error: page fill failed with (result:out of memory)
pending_fault_complete error: page fill failed with (result:out of memory)
demand_page_done error: out of memory in multiple page faults; program killed

Process abort: SIGKILL received by thread 8

As for why I don't use the latest version of ollama, it's because the llama.cpp backend was originally loaded through a dynamic dependency library, but in, it was changed to loading through subprocessing, which also means that the later version cannot run in Unikernel. In the commit description, it can be seen that the main purpose is to solve the problems of memory leaks and stability defects. So, is there any way to solve the above problem? Looking forward to your help.

francescolavra commented 3 months ago

I think your Nanos instance has less memory assigned to it (the default value used by Ops is 2 GB) than your application requires, and you need to give it more memory. You can set the amount of memory in the Ops configuration file by adding a "Memory" attribute in the "RunConfig" JSON object. Example to configure the instance with 4 GB:

  "RunConfig": {
    "Memory": "4G"
leeyiding commented 3 months ago

Great, your approach is correct. I have another question. The new version of ollama I just mentioned uses the subprocess method to enable the llama.cpp backend. Do you have any ideas to bypass it?

eyberg commented 3 months ago

you're going to need to reverse what they did there - you might consider opening an issue w/them on it

leeyiding commented 3 months ago

OK, I got it, thank you very much for your reply