microsoft / T-MAC

Low-bit LLM inference on CPU with lookup table
MIT License
420 stars 32 forks source link

Any plans to merge the latest code of llama.cpp? #24

Open peytoncai opened 3 weeks ago

peytoncai commented 3 weeks ago

Qwen2

warning: not compiled with GPU offload support, --n-gpu-layers option will be ignored warning: see main README.md for information on enabling GPU BLAS support Log start main: build = 2854 (70c312d) main: built with clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18) for x86_64-unknown-linux-gnu main: seed = 1724130565 [13:09:25] /aaaa/T-MAC/3rdparty/llama.cpp/ggml-tmac.cpp:38: ggml_tmac_init llama_model_loader: loaded meta data with 20 key-value pairs and 386 tensors from /aaaa/Qwen1.5-0.5B-Chat-GPTQ-Int4/ggml-model.in.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.name str = Qwen1.5-0.5B-Chat-GPTQ-Int4 llama_model_loader: - kv 2: qwen2.block_count u32 = 24 llama_model_loader: - kv 3: qwen2.context_length u32 = 32768 llama_model_loader: - kv 4: qwen2.embedding_length u32 = 1024 llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 2816 llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 16 llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 16 llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 10: general.file_type u32 = 32 llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 19: tokenizer.chat_template str = {% for message in messages %}{% if lo... llama_model_loader: - type f32: 217 tensors llama_model_loader: - type f16: 1 tensors llama_model_loader: - type i4: 168 tensors llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'qwen2' llama_load_model_from_file: failed to load model llama_init_from_gpt_params: error: failed to load model ' /aaaa/Qwen1.5-0.5B-Chat-GPTQ-Int4/ggml-model.in.gguf' main: error: unable to load model

gemma2

Running STEP.0: Compile kernels Running command in /aaaa/T-MAC/deploy: python compile.py -o tuned -da -nt 4 -tb -gc -gs 128 -ags 64 -t -m gptq-auto -md /aaaa/gemma-2-9b-it-gptq-4bit Running STEP.1: Build T-MAC C++ CMakeFiles Running command in /aaaa/T-MAC/build: cmake -DCMAKE_INSTALL_PREFIX=/aaaa/T-MAC/install .. Running STEP.2: Install T-MAC C++ Running command in /aaaa/T-MAC/build: cmake --build . --target install --config Release Running STEP.3: Convert HF to GGUF Running command in /aaaa/T-MAC/3rdparty/llama.cpp: python convert-hf-to-gguf-t-mac.py /aaaa/gemma-2-9b-it-gptq-4bit --outtype in --outfile /aaaa/gemma-2-9b-it-gptq-4bit/ggml-model.in.gguf --kcfg /aaaa/T-MAC/install/lib/kcfg.ini Please check logs/2024-08-20-15-29-20.log for what's wrong (tmac) root@4c5e2a287200:/aaaa/T-MAC# cat logs/2024-08-20-15-29-20.log INFO:hf-to-gguf:Loading model: gemma-2-9b-it-gptq-4bit Traceback (most recent call last): File "convert-hf-to-gguf-t-mac.py", line 3421, in main() File "convert-hf-to-gguf-t-mac.py", line 3399, in main model_class = Model.from_model_architecture(hparams["architectures"][0]) File "convert-hf-to-gguf-t-mac.py", line 318, in from_model_architecture raise NotImplementedError(f'Architecture {arch!r} not supported!') from None NotImplementedError: Architecture 'Gemma2ForCausalLM' not supported!

### Tasks
- [ ] Update llama.cpp
kaleid-liner commented 3 weeks ago

We are working on it. llama.cpp is evolving very fast with a lot of refactoring here and there, so it won't be very quick.