second-state / WasmEdge-WASINN-examples

Apache License 2.0
217 stars 36 forks source link

very slow and issues in ubuntu wsl with cuda #67

Closed eramax closed 2 months ago

eramax commented 6 months ago

With my NVIDIA 1050TI and Ubuntu 23.04 WSL, the model takes a long time to load—about three minutes—but the response is, I believe, faster than with other tools. Once the first question was answered, the program crashed.

  llama-2-7b-chat-Q5_K_M-gguf wasmedge --dir .:. \
  --env stream_stdout=true \
  --env enable_log=false \
  --env ctx_size=512 \
  --env n_predict=512 \
  --env n_gpu_layers=20 \
  --nn-preload default:GGML:AUTO:llama-2-7b-chat.Q5_K_M.gguf \
  wasmedge-ggml-llama-interactive.wasm default
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1050 Ti, compute capability 6.1
Question:
what is NVIDIA GTX 1080 8G
Answer:
  NVIDIA GTX 1080 8G is a high-performance graphics processing unit (GPU) designed for gaming and professional applications. It is part of the NVIDIA GeForce GTX 1080 series and features 8GB of GDDR5X memory. The "8G" in the name refers to the amount of onboard memory.
The GTX 1080 8G is known for its exceptional performance in demanding games and applications, offering fast frame rates and smooth graphics. It is also well-suited for tasks such as 3D modeling, video editing, and scientific simulations.
Some of the key features of the GTX 1080 8G include:
* 8GB of GDDR5X memory
* 2560 CUDA cores
* 1506MHz base clock speed
* 1733MHz boost clock speed
* 1128 texture units
* 64 ROPs
* 256-bit memory interface
* Support for DirectX 12 and Vulkan APIs
* Requires a 6-pin and 8-pin power connector

Overall, the NVIDIA GTX 1080 8G is a powerful and high-performance GPU that is well-suited for gaming and professional applications that require fast graphics processing.
Question:
how touse curl
Answer:
[2023-11-14 21:22:05.369] [info] [WASI-NN] GGML backend: llama_decode() failed
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: BackendError(RuntimeError)', src/main.rs:111:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[2023-11-14 21:22:05.369] [error] execution failed: unreachable, Code: 0x89
[2023-11-14 21:22:05.369] [error]     In instruction: unreachable (0x00) , Bytecode offset: 0x000107ad
[2023-11-14 21:22:05.369] [error]     When executing function name: "_start"
➜  llama-2-7b-chat-Q5_K_M-gguf
eramax commented 6 months ago

this is the second try

llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q5_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_print_meta: format           = GGUF V2 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly Q5_K - Medium
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 4.45 GiB (5.68 BPW)
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.10 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =   86.04 MB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 4474.93 MB
...................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 256.00 MB
llama_new_context_with_model: kv self size  =  256.00 MB
llama_new_context_with_model: compute buffer total size = 76.63 MB
llama_new_context_with_model: VRAM scratch buffer: 70.50 MB
llama_new_context_with_model: total VRAM used: 4801.43 MB (model: 4474.93 MB, context: 326.50 MB)
Answer:
[2023-11-14 21:35:13.952] [info] [WASI-NN] GGML backend: llama_system_info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
  WSL (Windows Subsystem for Linux) is a feature in Windows 10 that allows users to run a full Linux environment directly on top of the Windows operating system. This means that users can run Linux commands, applications, and tools alongside Windows applications, without the need to install a separate Linux operating system. WSL is designed to provide a seamless and efficient way to use Linux on Windows, while still taking advantage of the security and stability of the Windows platform.
llama_print_timings:        load time =  129174.03 ms
llama_print_timings:      sample time =       3.85 ms /    98 runs   (    0.04 ms per token, 25461.16 tokens per second)
llama_print_timings: prompt eval time =    1295.73 ms /    43 tokens (   30.13 ms per token,    33.19 tokens per second)
llama_print_timings:        eval time =   21818.37 ms /    97 runs   (  224.93 ms per token,     4.45 tokens per second)
llama_print_timings:       total time =  151004.54 ms

this time it didn't crash after the first question, but always it unload the model from gpu and when i ask it again it reload it which mean it takes like 2-3min again to load the model, I noticed that this app doesn't do pressure on cpu not like other apps (e.g, ollama)

Please help to fix the load time, i guess this aoo shall be a perfect solution.

Best,

juntao commented 6 months ago

The 1050 Ti only has 4GB of RAM. The quantized 4B model itself is 4GB. So, it barely runs ... When you ask more questions, the context length grows and the machine demands more GPU RAM, which causes failure.

One thing you could try is to tweak the -ngl CLI argument to reduce the number of layers in the GPU (say set it to 10). It will make the inference slower, but it can make use of CPU RAMs.

juntao commented 6 months ago

@hydai Can you look into the "reloading after each question" problem? I thought we addressed it on our test machines? Thanks.

eramax commented 6 months ago

I tried the llama-chat.wasm and It couldn't work

➜  llama-2-7b-chat-Q5_K_M-gguf wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-chat-q5_k_m.gguf \
   llama-chat.wasm -c 2048 -n 512 --log-stat
[INFO] Model alias: default
[INFO] Prompt context size: 2048
[INFO] Number of tokens to predict: 512
[INFO] Number of layers to run on the GPU: 100
[INFO] Batch size for prompt processing: 4096
[INFO] Use default system prompt
[INFO] Prompt template: Llama2Chat
[INFO] Stream stdout: false
[INFO] Log prompts: false
[INFO] Log statistics: true
[INFO] Log all information: false
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1050 Ti, compute capability 6.1
[2023-11-14 21:59:22.222] [error] [WASI-NN] GGML backend: Error: unable to init model.
Error: "Fail to load model into wasi-nn: Backend Error: WASI-NN Backend Error: Caller module passed an invalid argument"
➜  llama-2-7b-chat-Q5_K_M-gguf

The app is really fast, but I have to wait for the loading and reloading everytime which is anayoning. I wish if the app can utilize the gpu ram and pc ram as well (other tools already manage it and i can run 34B models ) the reload I guess happenes when I submit the new question, I already reduced the number of gpu layers and I can see that my gpu memory is arround 50% and it crashed in the second question while my gpu ram get freed

llm_load_print_meta: format           = GGUF V2 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly Q5_K - Medium
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 4.45 GiB (5.68 BPW)
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.10 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  = 3178.93 MB
llm_load_tensors: offloading 10 repeating layers to GPU
llm_load_tensors: offloaded 10/35 layers to GPU
llm_load_tensors: VRAM used: 1382.04 MB
...................................................................................................
[2023-11-14 22:11:25.780] [error] [WASI-NN] GGML backend: Error: prompt too long (524 tokens, max 508)
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: BackendError(InvalidArgument)', src/main.rs:105:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[2023-11-14 22:11:25.780] [error] execution failed: unreachable, Code: 0x89
[2023-11-14 22:11:25.780] [error]     In instruction: unreachable (0x00) , Bytecode offset: 0x000107ad
[2023-11-14 22:11:25.780] [error]     When executing function name: "_start"
juntao commented 6 months ago

You could specify a smaller GPU layers using the -g in llama-chat.wasm. Perhaps try -g 15?

Yes, the model reloading is a problem we need to fix. Thank you for reporting this!

wasmedge llama-chat.wasm --help
Usage: llama-chat.wasm [OPTIONS]

Options:
  -m, --model-alias <ALIAS>
          Model alias [default: default]
  -c, --ctx-size <CTX_SIZE>
          Size of the prompt context [default: 4096]
  -n, --n-predict <N_PRDICT>
          Number of tokens to predict [default: 1024]
  -g, --n-gpu-layers <N_GPU_LAYERS>
          Number of layers to run on the GPU [default: 100]
  -b, --batch-size <BATCH_SIZE>
          Batch size for prompt processing [default: 4096]
  -r, --reverse-prompt <REVERSE_PROMPT>
          Halt generation at PROMPT, return control.
  -s, --system-prompt <SYSTEM_PROMPT>
          System prompt message string [default: "[Default system message for the prompt template]"]
  -p, --prompt-template <TEMPLATE>
          Prompt template. [default: llama-2-chat] [possible values: llama-2-chat, codellama-instruct, mistral-instruct-v0.1, mistrallite, openchat, belle-llama-2-chat, vicuna-chat, chatml]
      --log-prompts
          Print prompt strings to stdout
      --log-stat
          Print statistics to stdout
      --log-all
          Print all log information to stdout
      --stream-stdout
          Print the output to stdout in the streaming way
  -h, --help
          Print help
niranjanakella commented 6 months ago

@eramax you can try to use tiny-llama 1B by quantizing it if you wish to run it on 1050Ti, the model would be smaller and should be able to handle a few turns of conversation. But just like what @juntao mentioned, you will still be limited to a certain threshold of questions that you will be able to ask.

hydai commented 6 months ago

@hydai Can you look into the "reloading after each question" problem? I thought we addressed it on our test machines? Thanks.

Hi @eramax Due to the workaround for the current WASI-NN spec(the interface we used for this plugin), we will load the model twice. The first load it without setting the ngl, and the second reloads it once the ngl is set. Reloading a model twice may take time.

We know this is expected behavior temporarily, but it's annoying, so we are creating an extension for the current WASI-NN spec to prevent it from happening again. This feature will be shipped in our next release.

Maximizing the leverage of the hardware(e.g., CPU+GPU co-work) is also important to us, we will keep making this happen.

hydai commented 5 months ago

Hi @eramax We've updated the plugin. You can retrieve the latest version by re-running the installer. The new one should solve the reloading issue. However, we are still working on the CPU+GPU co-work. Thanks.

hydai commented 2 months ago

Since there are no more updates, closing this.