nanovms / nanos

A kernel designed to run one and only one application in a virtualized environment
https://nanos.org
Apache License 2.0
2.62k stars 138 forks source link

Failed to integrate ollama in Nanos #2033

Closed leeyiding closed 3 months ago

leeyiding commented 3 months ago

Hello, I'm sorry to bother you again, but I encountered some problems in the process of building ollama applications, and I need your help.

The ollama version I use is 0.1.31, and my directory tree is as follows, ollama binaries can be downloaded from ollama-linux-amd64 and renamed to ollama.

.
├── config.json
├── klibs
│   └── gpu_nvidia
├── nvidia
│   ├── 535.113.01
│   │   ├── gsp_ga10x.bin
│   │   └── gsp_tu10x.bin
│   └── LICENSE
├── ollama
├── tmp
└── usr
    └── lib
        ├── libc.so.6
        ├── libcuda.so.1
        ├── libdl.so.2
        ├── libgcc_s.so.1
        ├── libm.so.6
        ├── libnvidia-ptxjitcompiler.so.1
        ├── libpthread.so.0
        ├── libresolv.so.2
        ├── librt.so.1
        └── libstdc++.so.6

The configuration file is as follows:

{
  "Dirs": [
    "usr",
    "tmp",
    "nvidia"
  ],
  "MapDirs": {
    "/root/.ollama/*": ".ollama"
  },
  "Args": [
    "serve"
  ],
  "Env": {
    "OLLAMA_HOST": "0.0.0.0:11434",
    "HOME": "/"
  },
  "KlibDir": "./klibs",
  "Klibs": [
    "gpu_nvidia"
  ],
  "RunConfig": {
    "GPUs": 1,
    "Ports": ["11434"]
  },
  "BaseVolumeSz": "2g"
}

Step 1, run ./ollama serve locally to start a service, and then run./ollama pull qwen: 0.5b in another terminal to pull a model. Step 2, terminate the service started in the previous step, and then run ops run ollama -c config.json -n to run Nanos, the program starts normally at this step Step 3, Call api on another terminal. curl http://localhost:11434/api/generate -d '{ "model": "qwen:0.5b","prompt": "Hello!" }' After completing the step 3, an error occurred. Here is the running log.

running local instance
booting /root/.ops/images/ollama ...
en1: assigned 10.0.2.15
NVRM _sysCreateOs: RM Access Sys Cap creation failed: 0x56
NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  535.113.01  Release Build  (circleci@12a9ece3f331)  Sun Jun 23 02:11:32 AM UTC 2024
Loaded the UVM driver, major device number 0.
time=2024-06-25T11:11:45.432Z level=INFO source=images.go:804 msg="total blobs: 9"
time=2024-06-25T11:11:45.434Z level=INFO source=images.go:811 msg="total unused blobs removed: 0"
time=2024-06-25T11:11:45.434Z level=INFO source=routes.go:1118 msg="Listening on [::]:11434 (version 0.1.31)"
time=2024-06-25T11:11:45.436Z level=INFO source=payload_common.go:113 msg="Extracting dynamic libraries to /tmp/ollama3666727321/runners ..."
en1: assigned FE80::38F1:91FF:FE78:F331
time=2024-06-25T11:11:48.585Z level=INFO source=payload_common.go:140 msg="Dynamic LLM libraries [rocm_v60000 cpu cpu_avx cpu_avx2 cuda_v11]"
time=2024-06-25T11:11:48.586Z level=INFO source=gpu.go:115 msg="Detecting GPU type"
time=2024-06-25T11:11:48.587Z level=INFO source=gpu.go:265 msg="Searching for GPU management library libcudart.so*"
time=2024-06-25T11:11:48.588Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [/tmp/ollama3666727321/runners/cuda_v11/libcudart.so.11.0]"
time=2024-06-25T11:11:49.767Z level=INFO source=gpu.go:120 msg="Nvidia GPU detected via cudart"
time=2024-06-25T11:11:49.768Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-06-25T11:11:49.854Z level=INFO source=gpu.go:188 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9"
time=2024-06-25T11:12:02.111Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-06-25T11:12:02.111Z level=INFO source=gpu.go:188 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9"
time=2024-06-25T11:12:02.112Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-06-25T11:12:02.113Z level=INFO source=gpu.go:188 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9"
time=2024-06-25T11:12:02.113Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
loading library /tmp/ollama3666727321/runners/cuda_v11/libext_server.so
time=2024-06-25T11:12:02.121Z level=INFO source=dyn_ext_server.go:87 msg="Loading Dynamic llm server: /tmp/ollama3666727321/runners/cuda_v11/libext_server.so"
time=2024-06-25T11:12:02.122Z level=INFO source=dyn_ext_server.go:147 msg="Initializing llama server"
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /.ollama/models/blobs/sha256-fad2a06e4cc705c2fa8bec5477ddb00dc0c859ac184c34dcc5586663774161ca (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.name str              = Qwen2-beta-0_5B-Chat
llama_model_loader: - kv   2:                          qwen2.block_count u32              = 24
llama_model_loader: - kv   3:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   4:                     qwen2.embedding_length u32              = 1024
llama_model_loader: - kv   5:                  qwen2.feed_forward_length u32              = 2816
llama_model_loader: - kv   6:                 qwen2.attention.head_count u32              = 16
llama_model_loader: - kv   7:              qwen2.attention.head_count_kv u32              = 16
llama_model_loader: - kv   8:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv   9:                qwen2.use_parallel_residual bool             = true
llama_model_loader: - kv  10:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  11:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  12:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  13:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  15:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  17:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - kv  19:                          general.file_type u32              = 2
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q4_0:  169 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 293/151936 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 1024
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 2816
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 0.5B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 619.57 M
llm_load_print_meta: model size       = 371.02 MiB (5.02 BPW) 
llm_load_print_meta: general.name     = Qwen2-beta-0_5B-Chat
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 24 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 25/25 layers to GPU
llm_load_tensors:        CPU buffer size =    83.46 MiB
llm_load_tensors:      CUDA0 buffer size =   287.57 MiB
...............................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   192.00 MiB
llama_new_context_with_model: KV self size  =  192.00 MiB, K (f16):   96.00 MiB, V (f16):   96.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =   298.75 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   298.75 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     6.00 MiB
llama_new_context_with_model: graph nodes  = 868
llama_new_context_with_model: graph splits = 2
pending_fault_complete error: page fill failed with (result:out of memory)
SIGBUS: bus error
PC=0x2778d8e9e m=7 sigcode=2 addr=0x36c29c000
signal arrived during cgo execution

goroutine 20 gp=0xc000007dc0 m=7 mp=0xc00043a808 [syscall]:
pending_fault_complete error: page fill failed with (result:out of memory)
pending_fault_complete error: page fill failed with (result:out of memory)
pending_fault_complete error: page fill failed with (result:out of memory)
pending_fault_complete error: page fill failed with (result:out of memory)
pending_fault_complete error: page fill failed with (result:out of memory)
pending_fault_complete error: page fill failed with (result:out of memory)
pending_fault_complete error: page fill failed with (result:out of memory)
pending_fault_complete error: page fill failed with (result:out of memory)
pending_fault_complete error: page fill failed with (result:out of memory)
demand_page_done error: out of memory in multiple page faults; program killed

Process abort: SIGKILL received by thread 8

lastvector: 000000000000000e (Page fault)
     frame: ffffc00002a01800
      type: thread
active_cpu: 00000000ffffffff
 stack top: 0000000000000000
error code: 0000000000000004
   address: 00000000118cb72e

   rax: 00000000118cb700
   rbx: 00000000000008d2
   rcx: 00000000000008d2
   rdx: 0000000000000000
   rsi: 00000000118cb72e
   rdi: 00000000118cb72e
   rbp: 000000c00049f398
   rsp: 000000c00049f348
    r8: 000000c00049f368
    r9: 0000000000000000
   r10: 0000000000000000
   r11: 00000000118cbfe0
   r12: 000000000000280e
   r13: 000000000000007c
   r14: 000000c00045ca80
   r15: 0000000000257128
   rip: 0000000000408420
rflags: 0000000000010246
    ss: 000000000000002b
    cs: 0000000000000023
    ds: 0000000000000000
    es: 0000000000000000
fsbase: 0000000101bfa6c0
gsbase: 0000000000000000

frame trace:
000000c00049f3a0:   000000000045dbb0
000000c00049f3d8:   0000000000466f16
000000c00049f638:   00000000004669c6
000000c00049f700:   000000000046682f
000000c00049f908:   00000000004665e8
000000c00049f940:   0000000000455c7f
000000c00049f9b0:   000000000045548e
000000c00049fa28:   0000000000475e26
000000c00049fa78:   0000000277872050
0000000101bf8698:   0000000109e2da0f
0000000101bf87a8:   0000000109d2f5da
0000000101bf89b8:   0000000109d3006f
0000000101bf8aa8:   00000001a450ecd4

kernel load offset fffffffff3c79000

loaded klibs: gpu_nvidia@0xffffffff96ce9000/0x81e000 

stack trace:
000000c00049f348:   000000000045c61f
000000c00049f350:   00000000118cb72e
000000c00049f358:   00000000000008d2
000000c00049f360:   0000000000000000
000000c00049f368:   0000000000000008
000000c00049f370:   00000000000008d2
000000c00049f378:   0000000000000000
000000c00049f380:   00000000118cb72e
000000c00049f388:   00000000118cb72e
000000c00049f390:   00000000000008d2
000000c00049f398:   000000c00049f3d0
000000c00049f3a0:   000000000045dbb0
000000c00049f3a8:   0000000000000008
000000c00049f3b0:   000000c00049f3d0
000000c00049f3b8:   00000000118cb72e
000000c00049f3c0:   0000000000000008
000000c00049f3c8:   0000000000000008
000000c00049f3d0:   000000c00049f630
000000c00049f3d8:   0000000000466f16
000000c00049f3e0:   000000c00049f730
000000c00049f3e8:   000000c0004a7080
000000c00049f3f0:   0000000000000008
000000c00049f3f8:   0000000000000008
000000c00049f400:   000000c0004a70a0
000000c00049f408:   000000c00049f4d8
000000c00049f410:   ffffffff0000280e
000000c00049f418:   0000000211a1a5d9
000000c00049f420:   0000000011f11dc0
000000c00049f428:   0000000000000008
000000c00049f430:   000000c0004a70e0
000000c00049f438:   000000000040a72a
000000c00049f440:   0000000000000008

Terminating.

As for why I don't use the latest version of ollama, it's because the llama.cpp backend was originally loaded through a dynamic dependency library, but in https://github.com/ollama/ollama/commit/58d95cc9bd446a8209e7388a96c70367cbafd653, it was changed to loading through subprocessing, which also means that the later version cannot run in Unikernel. In the commit description, it can be seen that the main purpose is to solve the problems of memory leaks and stability defects. So, is there any way to solve the above problem? Looking forward to your help.

francescolavra commented 3 months ago

I think your Nanos instance has less memory assigned to it (the default value used by Ops is 2 GB) than your application requires, and you need to give it more memory. You can set the amount of memory in the Ops configuration file by adding a "Memory" attribute in the "RunConfig" JSON object. Example to configure the instance with 4 GB:

  "RunConfig": {
    "Memory": "4G"
}
leeyiding commented 3 months ago

Great, your approach is correct. I have another question. The new version of ollama I just mentioned uses the subprocess method to enable the llama.cpp backend. Do you have any ideas to bypass it?

eyberg commented 3 months ago

you're going to need to reverse what they did there - you might consider opening an issue w/them on it

leeyiding commented 3 months ago

OK, I got it, thank you very much for your reply