mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
18.57k stars 1.5k forks source link

[Model Request] Stable-LM 1.6b #1642

Closed federicoparra closed 7 months ago

federicoparra commented 7 months ago

⚙️ Request New Models

https://stability.ai/news/introducing-stable-lm-2

Model weights: https://huggingface.co/stabilityai/stablelm-2-zephyr-1_6b

Just came out! New version of StableLM with only 1.6b nodes, and MULTILINGUAL capacity, that in theory outperforms Phi-1.5 (and approaches Mistral 7B in other languages) !!!

federicoparra commented 7 months ago

The model is IMPRESSIVE. https://huggingface.co/spaces/stabilityai/stablelm-2-1_6b-zephyr

It is VERY WELL trained to follow instructions, like summarization, but because it was trained in several languages it's BRILLIANT at translation (English to Spanish, to French, those I tried because I speak them).

So here we have a model almost as small as Phi-1.5 but that is fine tuned out of the box and multilingual.

We need this :)

federicoparra commented 7 months ago

@jeethu @Sing-Li I know you already support larger StableLM models. I wonder what it would entail to support this one. Do let me know if I can help in anything, I do need this model running on the Orange Pi!

jeethu commented 7 months ago

I took a brief look at it yesterday. The architecture is almost identical to the larger Stable LM 3B model, with only one difference: They've added bias terms to Q, K and V projections in attention. A bigger concern IMO is the tokenizer. They're using the QWen tokenizer (tiktoken based) and tokenizers-cpp currently only supports Huggingface and sentencepiece based tokenizers.

federicoparra commented 7 months ago

So in order to support this model, it would entail including support for a new tokenizer (QWen) in tokenizers-cpp, correct? Is this trivial or complex? I'm not proficient in CPP but can still write some CPP.

junrushao commented 7 months ago

The good news is that QWen’s tokenizer is already supported in MLC ;))

federicoparra commented 7 months ago

Ok so that means it is just adding the bias terms to Q, K and V in attention?

federicoparra commented 7 months ago

@junrushao anyway I can help to bring this model onto the fold? Let me know!

federicoparra commented 7 months ago

@junrushao @jeethu Hi guys, sorry for bothering you I'm assuming you must be busy. Is there anything I can do to make it work on my own? I really need this for a project I'm working on. thanks a lot!

DavidGOrtega commented 7 months ago

The good news is that QWen’s tokenizer is already supported in MLC ;))

@junrushao Are you sure is the same? I do not know QWen model deeply but the StableLM seems to be using tiktoken.

super().__init__(errors=errors, **kwargs)
self._tiktoken_config = _arcade100k(vocab_file)
self.tokenizer = tiktoken.Encoding(**self._tiktoken_config)
federicoparra commented 7 months ago

Hi guys, any news on this by any chance? @jeethu @junrushao

federicoparra commented 7 months ago

I took a brief look at it yesterday. The architecture is almost identical to the larger Stable LM 3B model, with only one difference: They've added bias terms to Q, K and V projections in attention. A bigger concern IMO is the tokenizer. They're using the QWen tokenizer (tiktoken based) and tokenizers-cpp currently only supports Huggingface and sentencepiece based tokenizers.

I managed to enable t stablelm16_model.zip he bias terms in the model (file attached), but since the mlc_chat convert_weight is not working even with StableLM3b I don't know how to proceed any further

DavidGOrtega commented 7 months ago

hey @federicoparra lets try! why don't you create a fork and a PR? It would be easier to work with it

federicoparra commented 7 months ago

I see this https://github.com/mlc-ai/mlc-llm/issues/1616 was not yet resolved.

I'm trying first making StableLM 3B work using instructions here https://github.com/mlc-ai/mlc-llm/pull/1008: python build.py --model path/to/stablelm-3b-4e1t --quantization q4f16_1 --target metal --use-safetensors

So far I'm having an error: safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge

I'm trying to figure it out.

Once/if I'm able to get compiled weights/model using this approach (which I had never used before! and doesn't seem to be in the documentation of MLC) then I'll try to do same with the 1.6B model (with the model file altered, as I shared earlier, to have bias for Q, K and V projections in attention.

I'll post here with results...

federicoparra commented 7 months ago

It was a silly LFS error (not downloading safetensor raw file but just pointer) it should work now I'll report a little later

federicoparra commented 7 months ago

The build process is ending abruptly on Google Colab, configured with: %pip install --pre -U -f https://mlc.ai/wheels mlc-ai-nightly-cu122 %pip install --pre -U -f https://mlc.ai/wheels mlc-chat-nightly-cu122 mlc-ai-nightly-cu122

Using path "/content/stablelm-3b-4e1t" for model "stablelm-3b-4e1t" Target configured: cuda -keys=cuda,gpu -arch=sm_75 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32 Automatically using target for weight quantization: cuda -keys=cuda,gpu -arch=sm_75 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32 Get old param: 0% 0/260 [00:00<?, ?tensors/s] Set new param: 0% 0/390 [00:00<?, ?tensors/s]Start computing and quantizing weights... This may take a while. Get old param: 0% 1/260 [00:02<11:09, 2.58s/tensors]^C

federicoparra commented 7 months ago

I took a brief look at it yesterday. The architecture is almost identical to the larger Stable LM 3B model, with only one difference: They've added bias terms to Q, K and V projections in attention. A bigger concern IMO is the tokenizer. They're using the QWen tokenizer (tiktoken based) and tokenizers-cpp currently only supports Huggingface and sentencepiece based tokenizers.

I also don't understand this: if the only difference between a 3B+ nodes model and a 1.6B nodes model would be the adding of 3 nodes (bias), then the resulting variant would be (a bit) larger, not almost half smaller (in terms of nodes) ?

jeethu commented 7 months ago

I also don't understand this: if the only difference between a 3B+ nodes model and a 1.6B nodes model would be the adding of 3 nodes (bias), then the resulting variant would be (a bit) larger, not almost half smaller (in terms of nodes) ?

You can have the same, or nearly the same model architecture and have a different number of transformer layers, and the width of hidden and intermediate layers in the MLP. This is also the case with Llama 2 models. Please look at num_hidden_layers, hidden_size and intermediate_size fields in the config.json file for both StableLM models. They're different (the smaller model has smaller values). I found the difference between the two by loading up the PyTorch models and comparing the layers. Also, unfortunately, I've been way too tied up with other things to take a look at this.

federicoparra commented 7 months ago

Ok I have sort of good news, bad news folks, @jeethu @DavidGOrtega @junrushao :

First good news: if you open a google colab (with high ram option, you need premium for that) and you do this: !git clone --recursive https://github.com/mlc-ai/mlc-llm.git %pip install --pre -U -f https://mlc.ai/wheels mlc-ai-nightly-cu122 %pip install --pre -U -f https://mlc.ai/wheels mlc-chat-nightly-cu122 mlc-ai-nightly-cu122 !git lfs install !git clone https://huggingface.co/stabilityai/stablelm-2-zephyr-1_6b !python /content/mlc-llm/build.py --model /content/stablelm-2-zephyr-1_6b/ --quantization q4f16_1 --use-safetensors --target llvm

This, strangely, works! without any modification, at least in the sense that it produces the quantized weights and the .so binary file.

Bad news: unfortunately the same on my own Orange Pi 5 targetting Mali does not work:

$python build.py --model ../models/stablelm-2-zephyr-1_6b --quantization q4f16_1 --u se-safetensors --target mali Using path "../models/stablelm-2-zephyr-1_6b" for model "stablelm-2-zephyr-1_6b" Target configured: opencl -keys=mali,opencl,gpu -device=mali -max_function_args=128 -max_num_threads=256 -max_shared_memory_per_block=16384 -max_threads_per_block=256 -texture_spatial_limit=16384 -thread_warp_size=1 arm_release_ver: g13p0-01eac0, rk_so_ver: 3 arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '7'. Automatically using target for weight quantization: opencl -keys=opencl,gpu -max_function_args=128 -max_num_threads=256 -max_shared_memory_per_block=16384 -max_threads_per_block=256 -texture_spatial_limit=16384 -thread_warp_size=1 Get old param: 0%| | 0/196 [00:00<?, ?tensors/sTraceback (most recent call last): | 0/294 [00:00<?, ?tensors/s] File "/home/federico/Documents/code/mlc-llm/build.py", line 4, in main() File "/home/federico/Documents/code/mlc-llm/mlc_llm/build.py", line 43, in main core.build_model_from_args(parsed_args) File "/home/federico/Documents/code/mlc-llm/mlc_llm/core.py", line 902, in build_model_from_args params = utils.convert_weights(mod_transform, param_manager, params, args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/federico/Documents/code/mlc-llm/mlc_llm/utils.py", line 294, in convert_weights ex = relax.build(mod_transform, target=target) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/federico/Documents/code/tvm-unity/python/tvm/relax/vm_build.py", line 336, in build return _vmlink( ^^^^^^^^ File "/home/federico/Documents/code/tvm-unity/python/tvm/relax/vm_build.py", line 247, in _vmlink lib = tvm.build( ^^^^^^^^^^ File "/home/federico/Documents/code/tvm-unity/python/tvm/driver/build_module.py", line 294, in build rt_mod_host = _driver_ffi.tir_to_runtime(annotated_mods, target_host) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/federico/Documents/code/tvm-unity/python/tvm/_ffi/_ctypes/packed_func.py", line 239, in call raise_last_ffi_error() File "/home/federico/Documents/code/tvm-unity/python/tvm/_ffi/base.py", line 481, in raise_last_ffi_error raise py_err File "/home/federico/Documents/code/tvm-unity/src/driver/driver_api.cc", line 527, in operator() return TIRToRuntime(inputs_arg, host_target); ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/federico/Documents/code/tvm-unity/src/driver/driver_api.cc", line 488, in tvm::TIRToRuntime(tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target const&) auto pair = SplitMixedModule(ir_module, target, target_host); ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/federico/Documents/code/tvm-unity/src/driver/driver_api.cc", line 415, in tvm::SplitMixedModule(tvm::IRModule, tvm::Target const&, tvm::Target const&) mod_mixed = ApplyPasses(mod_mixed, MixedModulePassManager(mod_mixed, target)); ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/federico/Documents/code/tvm-unity/src/driver/driver_api.cc", line 286, in tvm::ApplyPasses(tvm::IRModule, tvm::transform::Sequential) mod = seq(std::move(mod)); ^^^^^^^^^^^ File "/home/federico/Documents/code/tvm-unity/src/tir/analysis/verify_memory.cc", line 205, in operator() LOG(FATAL) << "RuntimeError: Memory verification failed with the following errors:\n" ^^^^^^^^^^^^^^^^^^^^^^^^^^^ tvm._ffi.base.TVMError: Traceback (most recent call last): 4: operator() at /home/federico/Documents/code/tvm-unity/src/driver/driver_api.cc:527 3: tvm::TIRToRuntime(tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target const&) at /home/federico/Documents/code/tvm-unity/src/driver/driver_api.cc:488 2: tvm::SplitMixedModule(tvm::IRModule, tvm::Target const&, tvm::Target const&) at /home/federico/Documents/code/tvm-unity/src/driver/driver_api.cc:415 1: tvm::ApplyPasses(tvm::IRModule, tvm::transform::Sequential) at /home/federico/Documents/code/tvm-unity/src/driver/driver_api.cc:286 0: operator() at /home/federico/Documents/code/tvm-unity/src/tir/analysis/verify_memory.cc:205 Did you forget to bind? Variable scale is directly accessed by host memory (it is not contained in a thread environment or in the function arguments. Variable A is directly accessed by host memory (it is not contained in a thread environment or in the function arguments. Variable w_gathered is directly accessed by host memory (it is not contained in a thread environment or in the function arguments. Variable w_gathered is directly accessed by host memory (it is not contained in a thread environment or in the function arguments. Variable w_gathered is directly accessed by host memory (it is not contained in a thread environment or in the function arguments. Variable scale is directly accessed by host memory (it is not contained in a thread environment or in the function arguments. Variable A is directly accessed by host memory (it is not contained in a thread environment or in the function arguments. File "/home/federico/Documents/code/tvm-unity/src/tir/analysis/verify_memory.cc", line 205 RuntimeError: Memory verification failed with the following errors:

from tvm.script import tir as T

@T.prim_func def encode4(A: T.Buffer((2048, 5632), "float16"), w_gathered: T.Buffer((2048, 704), "uint32"), scale: T.Buffer((2048, 176), "float16")): T.func_attr({"target": T.target({"host": {"keys": ["cpu"], "kind": "llvm", "tag": ""}, "keys": ["opencl", "gpu"], "kind": "opencl", "max_function_args": 128, "max_num_threads": 256, "max_shared_memory_per_block": 16384, "max_threads_per_block": 256, "tag": "", "texture_spatial_limit": 16384, "thread_warp_size": 1}), "tir.is_scheduled": T.bool(True), "tir.noalias": T.bool(True)}) max_abs_value = T.allocate([360448], "float16", "global") max_abs_value_1 = T.Buffer((360448,), "float16", data=max_abs_value) A_1 = T.Buffer((11534336,), "float16", data=A.data) for i, j, k in T.grid(2048, 176, 32): cse_var_1: T.int32 = i 176 + j if k == 0: max_abs_value_1[cse_var_1] = T.float16(-65504) max_abs_value_1[cse_var_1] = T.max(max_abs_value_1[cse_var_1], T.fabs(A_1[i 5632 + j 32 + k])) scale_1 = T.Buffer((360448,), "float16", data=scale.data) for i, j in T.grid(2048, 176): cse_var_2: T.int32 = i 176 + j scale_1[cse_var_2] = T.max(max_abs_value_1[cse_var_2], T.float16(0.0001)) T.float16(0.14285714285714285) for i, j, k in T.grid(2048, 704, 8): cse_var_3: T.int32 = i 704 + j w_gathered_1 = T.Buffer((1441792,), "uint32", data=w_gathered.data) if k == 0: w_gathered_1[cse_var_3] = T.uint32(0) w_gathered_1[cse_var_3] = T.bitwise_or(w_gathered_1[cse_var_3], T.shift_left(T.Cast("uint32", T.min(T.max(T.round(A_1[i 5632 + j 8 + k] / scale_1[i 176 + j // 4] + T.float16(7)), T.float16(0)), T.float16(14))), T.Cast("uint32", k) T.uint32(4)))

federicoparra commented 7 months ago

If I repeat the above procedure, in colab and using cuda as target when building, and then, in colab, cmake mlc-llm and use mlc_chat_cli...

Use MLC config: "/content/mlc-llm/build/dist/stablelm-2-zephyr-1_6b-q4f16_1/params/mlc-chat-config.json" Use model weights: "/content/mlc-llm/build/dist/stablelm-2-zephyr-1_6b-q4f16_1/params/ndarray-cache.json" Use model library: "/content/mlc-llm/build/dist/stablelm-2-zephyr-1_6b-q4f16_1/stablelm-2-zephyr-1_6b-q4f16_1-cuda.so" You can use the following special commands: /help print the special commands /exit quit the cli /stats print out the latest stats (token/sec) /reset restart a fresh chat /reload [model] reload model model from disk, or reload the current model if model is not specified

Loading model... [21:37:57] /content/mlc-llm/cpp/tokenizers.cc:81: Cannot find any tokenizer under: /content/mlc-llm/build/dist/stablelm-2-zephyr-1_6b-q4f16_1/params Stack trace: [bt] (0) /content/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::Backtrace[abi:cxx11]()+0x2c) [0x7b2b47212f1c] [bt] (1) /content/mlc-llm/build/mlc_chat_cli(tvm::runtime::detail::LogFatal::Entry::Finalize()+0x3b) [0x55fa53c9fe0b] [bt] (2) /content/mlc-llm/build/libmlc_llm.so(+0x2bc2d6) [0x7b2b4764b2d6] [bt] (3) /content/mlc-llm/build/libmlc_llm.so(mlc::llm::Tokenizer::FromPath(tvm::runtime::String const&)+0x8c1) [0x7b2b4764bba1] [bt] (4) /content/mlc-llm/build/libmlc_llm.so(mlc::llm::LLMChat::Reload(tvm::runtime::TVMArgValue, tvm::runtime::String, tvm::runtime::String)+0x598) [0x7b2b475c3ad8] [bt] (5) /content/mlc-llm/build/libmlc_llm.so(+0x23783c) [0x7b2b475c683c] [bt] (6) /content/mlc-llm/build/mlc_chat_cli(+0x141cc) [0x55fa53ca31cc] [bt] (7) /content/mlc-llm/build/mlc_chat_cli(+0xdddf) [0x55fa53c9cddf] [bt] (8) /content/mlc-llm/build/mlc_chat_cli(+0x99f9) [0x55fa53c989f9]

jeethu commented 7 months ago

Looks like there's a PR for it now. 🎉

federicoparra commented 7 months ago

AMAZING! I'll try

federicoparra commented 7 months ago

Works great! Compiled the branch and tried it, a really nice conversation in Spanish! Go MLC! 👏👏👏