mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
18.63k stars 1.51k forks source link

[Question] how to run Llama-3.1-Minitron-4B-Width-Base #2820

Closed huanglizhuo closed 2 weeks ago

huanglizhuo commented 3 weeks ago

❓ General Questions

I am trying to run Llama-3.1-Minitron-4B-Width-Base, in the readme they mention:

Pull requests to support this model in Hugging Face Transformers are currently under review (#32495 and #32502) and are expected to be merged soon. In the meantime, please follow the installation instructions below:

Fetch PR 32502

$ git clone -b suhara/llama-kv-channels --single-branch https://github.com/suhara/transformers.git && cd transformers

Fetch changes from PR 32495

$ git fetch https://github.com/suiyoubi/transformers.git aot/head_dim_rope && git cherry-pick FETCH_HEAD --strategy-option theirs

Install transformers

$ pip install -e .



After check the PR mentioned in above I found that `head_dim` already supported by mlc-llm, and looks like `assert self.head_dim * self.num_attention_heads == self.hidden_size` from the [llama_model.py](https://github.com/mlc-ai/mlc-llm/blob/0c0c7a60b452c708e1f9f95f5d46d07a17dd4296/python/mlc_llm/model/llama/llama_model.py#L87) is not required? so I did below steps:

1. remove `assert self.head_dim * self.num_attention_heads == self.hidden_size` from the `llama_model.py`
2. build mlc_llm from source and follow steps here: https://llm.mlc.ai/docs/compilation/compile_models.html#compile-model-libraries to convert the weight and compile model library
3. verify output and chat, but the chat response is nonsense, sometimes mix different language with nonsense content.

I think there must be some wrong understanding from myside, can anyone help give hint which direction should I check to run `Llama-3.1-Minitron-4B-Width-Base` model?
suhara commented 3 weeks ago

@huanglizhuo

I think you refer to this block. https://github.com/suhara/transformers/blob/e0af55227f022c535b5e71ebc89257956cede8bf/src/transformers/models/llama/modeling_llama.py#L350-L354

        if self.head_dim is None and (self.head_dim * self.num_heads) != self.hidden_size:
            raise ValueError(
                f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
                f" and `num_heads`: {self.num_heads})."
            )

Because of L343, which defines self.head_dim, the block will never be used. I'll remove it from the branch. But it should work even with this block.

FYI, PR #32495 has been merged now

  1. verify output and chat, but the chat response is nonsense, sometimes mix different language with nonsense content.

The model is a Base model, not an instruct model. It still may have minimal conversational ability. Please posting a question/request in the HF model hub page. https://huggingface.co/nvidia/Llama-3.1-Minitron-4B-Width-Base

huanglizhuo commented 3 weeks ago

@suhara thank you for the checking, but the mlc-llm codebase has a similar code here https://github.com/mlc-ai/mlc-llm/blob/0c0c7a60b452c708e1f9f95f5d46d07a17dd4296/python/mlc_llm/model/llama/llama_model.py#L87

But based on the PR you meneiton in https://huggingface.co/nvidia/Llama-3.1-Minitron-4B-Width-Base the self.head_dim * self.num_attention_heads will not equal self.hidden_size is my understanding correct or did I missed something?

suhara commented 3 weeks ago

@huanglizhuo Your understanding is correct. The custom head_dim should be supported on the MLC side as well. Each inference engine (e.g., HF, Llama.cpp, MLC) should support the architecture.

I'm not familiar with MLC at all but do you think you can make necessary changes?

FYI, you can refer to the PR for HF. https://github.com/huggingface/transformers/pull/32502/files

huanglizhuo commented 3 weeks ago

@suhara Thank you for the confirm, I actually check your PR for HF, and head_dim was already supported by MLC, the only blocker seems to be the below line, https://github.com/mlc-ai/mlc-llm/blob/0c0c7a60b452c708e1f9f95f5d46d07a17dd4296/python/mlc_llm/model/llama/llama_model.py#L87

I removed it as I mention here, but the chat output is nonsense. Let me read MLC code more carefully and see if I can find where did I miss.

huanglizhuo commented 3 weeks ago

--quantization QUANTIZATION_MODE The quantization mode we use to compile. See Quantization Mode for more information. Available options are: q0f16, q0f32, q3f16_1, q4f16_1, q4f32_1, and q4f16_awq. We encourage you to use 4-bit quantization, as the text generated by 3-bit quantized models may have bad quality depending on the model.

by the way when I do the weight convert I use --quantization q4f16_1, is there any possible that the quantization might cause the model output text to be nonsense?

huanglizhuo commented 3 weeks ago

will try convert weight without --quantization q4f16_1 see if there is any different

huanglizhuo commented 3 weeks ago

tried with --quantization q4f32_1, it still give nonsense response 😢
There must be something missing, will try to read more about mlc source code

YiyanZhai commented 2 weeks ago

Hi @huanglizhuo, thank you for bringing this issue to our attention.

Removal of assert self.head_dim * self.num_attention_heads == self.hidden_size from llama_model.py is now updated in PR #2848.

However, we've encountered similar observation when running inference on the Llama-3.1-Minitron-4B-Width-Base model using Hugging Face's transformers library directly:

Screenshot 2024-08-23 at 21 34 14

The model appears to produce reasonable output initially, and after generating a certain number of tokens, the output begins to lose coherence. This behavior suggests that while removing the assertion allows the model to run, it is not optimized for or possibly not intended for open-ended text generation or conversational tasks.

huanglizhuo commented 2 weeks ago

@YiyanZhai thank you for the update, then the issue is actually due to

The model is a Base model, not an instruct model. It still may have minimal conversational ability. Please posting a question/request in the HF model hub page.