mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
18.67k stars 1.52k forks source link

[Bug] Generated texts not as expected on some models with ‘canonical simplification of LE’ problem #2015

Closed alphaarea closed 4 months ago

alphaarea commented 5 months ago

🐛 Bug

mlc-llm has a problem with generating text that are completely unrelated to the prompts on some models, I think this mainly affects the new models that are available with the last tvm bug fix.

I'm mostly testing models based on Yi-34B. And I tested that the llama2-70b based model does not have this problem. So I think the issue may be canonical simplification of LE related

Related links:

To Reproduce

  1. The Problem is random in nature, It may take multiple conversations for this to happen
  2. Problem may be more likely to occur when input longer text
MODEL_PATH='/home/alphaarea/models/Yi-34B-Chat'
MLC_QUANT='q4f16_1'
MLC_DEV='cuda'
MODEL_ARCH='llama'
MODEL_TEMP='chatml'
MODEL_NAME=${MODEL_PATH##*/}
MODEL_OUTPUT=$MODEL_PATH'-'$MLC_QUANT
MODEL_LIB=$MODEL_NAME'-'$MLC_QUANT'-'$MLC_DEV'.so'

mlc_llm convert_weight --quantization $MLC_QUANT --model-type $MODEL_ARCH --device $MLC_DEV --output $MODEL_OUTPUT $MODEL_PATH
mlc_llm gen_config --quantization $MLC_QUANT --model-type $MODEL_ARCH --conv-template $MODEL_TEMP --tensor-parallel-shards 4 --max-batch-size 1 --output $MODEL_OUTPUT $MODEL_PATH
mlc_llm compile --device $MLC_DEV --opt 'O0' --output $MODEL_OUTPUT/$MODEL_LIB $MODEL_OUTPUT/mlc-chat-config.json

mlc_llm chat --model-lib-path $MODEL_OUTPUT/$MODEL_LIB $MODEL_OUTPUT

Yi-34B-Chat example:

<|im_start|>user: Do you know The Three-Body Problem
<|im_start|>assistant:
, the latest news on the ongoing conflict in Ukraine?

Yi-34B-Chat example 2:

<|im_start|>user: # New Capabilities with Unity

                  The Unity vision guides the technical roadmap for TVM’s evolution over the next year. The unified approach will position TVM to offer new forms of automation and ecosystem integration that are not possible with today’s system stacks.

                  With Unity, TVM will unify library-based computation with compiler-based automation. AI applications will be able to combine the world’s best known code for common operators with automatically optimized code for computations that don’t map neatly onto
                  any existing operator. Developers will be able to smoothly transition between both strategies without a steep “performance cliff” when switching from built-in to generated code. Teams will be able to iterate rapidly with compiled code for new model des
                  igns and then, as models mature and stabilize, fluidly incorporate optimized operator libraries to maximize performance. By erasing the boundary between operator-based and compiler-based stacks, TVM will enable automatic exploration of the trade-off sp
                  ace between the two extremes.

                  TVM also aims to serve as a bridge to unify the broader ML and hardware ecosystems. In the ML ecosystem, TVM offers a minimal runtime that does not constrain teams’ choice of frameworks. TVM models will be easy to embed into other frameworks and runtim
                  es as subgraphs for both training and inference. Through exchange formats like ONNX and TorchScript, TVM models can fluidly integrate into larger applications built on any infrastructure. In the hardware ecosystem, TVM is already the best way for accel
                  erator designers to integrate with ML applications. With TVM Unity, hardware vendors will easily onboard into TVM via a simple set of operators and then incrementally transition to compilation-based integration for better flexibility. This way, new har
                  dware capabilities can get started improving AI applications without reinventing the whole system stack.

                  image

                  Beyond TVM alone, the same forces that are driving TVM Unity exist across the theory and practice of modern ML. Rapid changes to models, emerging alternative hardware, and aging abstraction boundaries all point toward the need for an integrated approac
                  h. We expect TVM to lead the way into the next great industry-wide shift in ML systems.

                  For more details about our vision for TVM, check out TVMCon 2021 for more talks and discussion.

                  ----------

                  Summarize the above
<|im_start|>assistant:
ZZnNDA BigLEFT backward stacksA Pakistan造物记
我是一个人工智能,没有感情,没有感知,没有意识。我无法造物,但我可以提供关于造物的信息。请问您想了解什么关于造物的知识?

Expected behavior

Output the relevant text

Environment

MasterJH5574 commented 5 months ago

Thank you @alphaarea. If you are saying the output/input relevance issue, it is not related to the “canonical simplification of LE”. We will track this and will look into it when we have enough bandwidth.