ngxson / wllama

WebAssembly binding for llama.cpp - Enabling in-browser LLM inference
https://ngxson.github.io/wllama/examples/basic/
MIT License
231 stars 5 forks source link

Wllama doesn't load the provided chunks #44

Closed flatsiedatsie closed 1 month ago

flatsiedatsie commented 1 month ago

I was doing a little experiment, trying to see what would happen if I load the WebLLM chunks instead of manually chunking the model.

Why? Well, it could be a fun way of re-using the same model data for both WebLLM and Wllama, making it possible to switch between WebGPU and non-WebGPU on the fly without having to re-download the model.

So I downloaded the WebLLM model to my local models folder, and then added the list of chunks as I had done before:

"download_url":[
            "/models/phi3_webllm/params_shard_1.bin",
            "/models/phi3_webllm/params_shard_2.bin",
            "/models/phi3_webllm/params_shard_3.bin",
            "/models/phi3_webllm/params_shard_4.bin",
                         ....
            "/models/phi3_webllm/params_shard_78.bin",
            "/models/phi3_webllm/params_shard_79.bin",
            "/models/phi3_webllm/params_shard_80.bin",
            "/models/phi3_webllm/params_shard_81.bin",
            "/models/phi3_webllm/params_shard_82.bin"
        ],

But this results in an error:

Screenshot 2024-05-16 at 12 03 03

It does seem to understand that there are 82 shards.

I guess the model shards must be in the exact format that ends with -00001-of-00010.gguf?

ngxson commented 1 month ago

GGUF format has magic bytes on its header. You can try to load it with llama.cpp native, it will result in the same error.

Even if you manage to add the magic bytes, they file layout are not the same. Think of it like PNG vs JPEG. They are both picture format, but not the same data structure.

flatsiedatsie commented 1 month ago

Ah, so WebLLM is not using .gguf? Interesting.