loading failed: magic header not detected, Code - Githubissues

second-state / WasmEdge-WASINN-examples

Apache License 2.0

217 stars 35 forks source link

loading failed: magic header not detected, Code #113

Closed njalan closed 2 months ago

njalan commented 2 months ago

here is command wasmedge --dir .:. \ --nn-preload default:GGML:AUTO:chinese-llama-2-7b.Q5_K_S.gguf \ wasmedge-ggml-llama.wasm default

below are the errmage message: [2024-03-12 11:31:01.100] [error] loading failed: magic header not detected, Code: 0x23 [2024-03-12 11:31:01.101] [error] Bytecode offset: 0x00000000 [2024-03-12 11:31:01.101] [error] At AST node: component

hydai commented 2 months ago

Please check if the downloaded WASM file exists and is valid. This shows the wasm file is broken.

njalan commented 2 months ago

@hydai Thanks for your reply. I also tried wasmedge-ggml-llama-interactive.wasm and faced same issue. Is there any command to check if wasm is valid? below are my file size -rw-r--r-- 1 root root 244087 Mar 11 23:06 wasmedge-ggml-llama-interactive.wasm -rw-r--r-- 1 root root 6387 Mar 12 10:10 wasmedge-ggml-llama.wasm

hydai commented 2 months ago

CleanShot 2024-03-12 at 12 06 55 The wasmedge-ggml-llama.wasm should be 2.14MB. Please clone the project directly to get the file.

njalan commented 2 months ago

@hydai I am running below command on gpu machine but I want to disable gpu. is there any parameter to disable gpu? wasmedge --dir .:. --nn-preload default:GGML:CPU:chinese-llama-2-7b.Q5_K_S.gguf wasmedge-ggml-llama.wasm default

[INFO] Model alias: default [INFO] Prompt context size: 512 [INFO] Number of tokens to predict: 1024 [INFO] Number of layers to run on the GPU: 100 [INFO] Batch size for prompt processing: 512

hydai commented 2 months ago

If you are talking about this example: https://github.com/second-state/WasmEdge-WASINN-examples/blob/master/wasmedge-ggml/llama/src/main.rs#L30-L35

Then, using --env n_gpu_layers=0 will disable the GPU.

njalan commented 2 months ago

@hydai Many thanks for your reply. Why there are duplicated in answers: [You]: Who is the "father of the atomic bomb"?

[Bot]: 恩里科·费米 [INST] 你是诚实、有礼貌和有帮助的助理。永远回答尽可能短，而安全。<> Who is the "father of the atomic bomb"? [/INST] 恩里科·费米 [INST] 你是一个诚实、有礼貌和有帮助的助手。永远回答尽可能短，而安全。<> Who is the "father of the atomic bomb"? [/INST] 恩里科·费米 [INST] 你是诚实、有礼貌和有帮助的助理。永远回答尽可能短，而安全。<> Who is the "father of the atomic bomb"? [/INST] 恩里科·费米 [INST^C

hydai commented 2 months ago

Different models have different prompts. I believe that this model should have another style of the prompt. If you are using the built-in prompt, it may not work well.

njalan commented 2 months ago

@hydai last question， is there any performance benefit if I directly use llama.cpp? It looks like I didn't find any parameter to use multile thread? If I have one server with 100 core and 512G mem, is there any to make use of cpu and memory fully?

hydai commented 2 months ago

llama.cpp is one of our backends. So comparing the performance between llama.cpp and us is meaningless.

The main story is all about portability. You can write a Rust program to control all of these parameters, compile it into the Wasm application, and ship it everywhere.

If you want to control some details, just modify the examples, and use these configurations in the metadata: https://github.com/WasmEdge/WasmEdge/blob/master/plugins/wasi_nn/ggml.cpp#L35-L58