[Bug] Generated texts not as expected on some models with ‘canonical simplification of LE’ problem

🐛 Bug

mlc-llm has a problem with generating text that are completely unrelated to the prompts on some models, I think this mainly affects the new models that are available with the last tvm bug fix.

I'm mostly testing models based on Yi-34B. And I tested that the llama2-70b based model does not have this problem. So I think the issue may be canonical simplification of LE related

To Reproduce

The Problem is random in nature, It may take multiple conversations for this to happen
Problem may be more likely to occur when input longer text

MODEL_PATH='/home/alphaarea/models/Yi-34B-Chat'
MLC_QUANT='q4f16_1'
MLC_DEV='cuda'
MODEL_ARCH='llama'
MODEL_TEMP='chatml'
MODEL_NAME=${MODEL_PATH##*/}
MODEL_OUTPUT=$MODEL_PATH'-'$MLC_QUANT
MODEL_LIB=$MODEL_NAME'-'$MLC_QUANT'-'$MLC_DEV'.so'

mlc_llm convert_weight --quantization $MLC_QUANT --model-type $MODEL_ARCH --device $MLC_DEV --output $MODEL_OUTPUT $MODEL_PATH
mlc_llm gen_config --quantization $MLC_QUANT --model-type $MODEL_ARCH --conv-template $MODEL_TEMP --tensor-parallel-shards 4 --max-batch-size 1 --output $MODEL_OUTPUT $MODEL_PATH
mlc_llm compile --device $MLC_DEV --opt 'O0' --output $MODEL_OUTPUT/$MODEL_LIB $MODEL_OUTPUT/mlc-chat-config.json

mlc_llm chat --model-lib-path $MODEL_OUTPUT/$MODEL_LIB $MODEL_OUTPUT

Yi-34B-Chat example:

<|im_start|>user: Do you know The Three-Body Problem
<|im_start|>assistant:
, the latest news on the ongoing conflict in Ukraine?

Yi-34B-Chat example 2:

<|im_start|>user: # New Capabilities with Unity

                  The Unity vision guides the technical roadmap for TVM’s evolution over the next year. The unified approach will position TVM to offer new forms of automation and ecosystem integration that are not possible with today’s system stacks.

                  With Unity, TVM will unify library-based computation with compiler-based automation. AI applications will be able to combine the world’s best known code for common operators with automatically optimized code for computations that don’t map neatly onto
                  any existing operator. Developers will be able to smoothly transition between both strategies without a steep “performance cliff” when switching from built-in to generated code. Teams will be able to iterate rapidly with compiled code for new model des
                  igns and then, as models mature and stabilize, fluidly incorporate optimized operator libraries to maximize performance. By erasing the boundary between operator-based and compiler-based stacks, TVM will enable automatic exploration of the trade-off sp
                  ace between the two extremes.

                  TVM also aims to serve as a bridge to unify the broader ML and hardware ecosystems. In the ML ecosystem, TVM offers a minimal runtime that does not constrain teams’ choice of frameworks. TVM models will be easy to embed into other frameworks and runtim
                  es as subgraphs for both training and inference. Through exchange formats like ONNX and TorchScript, TVM models can fluidly integrate into larger applications built on any infrastructure. In the hardware ecosystem, TVM is already the best way for accel
                  erator designers to integrate with ML applications. With TVM Unity, hardware vendors will easily onboard into TVM via a simple set of operators and then incrementally transition to compilation-based integration for better flexibility. This way, new har
                  dware capabilities can get started improving AI applications without reinventing the whole system stack.

                  image

                  Beyond TVM alone, the same forces that are driving TVM Unity exist across the theory and practice of modern ML. Rapid changes to models, emerging alternative hardware, and aging abstraction boundaries all point toward the need for an integrated approac
                  h. We expect TVM to lead the way into the next great industry-wide shift in ML systems.

                  For more details about our vision for TVM, check out TVMCon 2021 for more talks and discussion.

                  ----------

                  Summarize the above
<|im_start|>assistant:
ZZnNDA BigLEFT backward stacksA Pakistan造物记
我是一个人工智能，没有感情，没有感知，没有意识。我无法造物，但我可以提供关于造物的信息。请问您想了解什么关于造物的知识？

Expected behavior

Output the relevant text

Environment

Platform: CUDA
Operating system: Ubuntu 22.04.3 LTS (5.15.0-91-generic)
Device: Tesla P100 x4
How you installed MLC-LLM: python3 -m pip install --pre -U -f https://mlc.ai/wheels mlc-chat-nightly-cu121 mlc-ai-nightly-cu121
How you installed TVM-Unity: python3 -m pip install --pre -U -f https://mlc.ai/wheels mlc-chat-nightly-cu121 mlc-ai-nightly-cu121
Python version: 3.11
GPU driver version: 545.23.08
CUDA/cuDNN version: 12.1

TVM Unity Hash Tag:

USE_NVTX: OFF
USE_GTEST: AUTO
SUMMARIZE: OFF
TVM_DEBUG_WITH_ABI_CHANGE: OFF
USE_IOS_RPC: OFF
USE_MSC: OFF
USE_ETHOSU:
CUDA_VERSION: 12.1
USE_LIBBACKTRACE: AUTO
DLPACK_PATH: 3rdparty/dlpack/include
USE_TENSORRT_CODEGEN: OFF
USE_THRUST: ON
USE_TARGET_ONNX: OFF
USE_AOT_EXECUTOR: ON
BUILD_DUMMY_LIBTVM: OFF
USE_CUDNN: OFF
USE_TENSORRT_RUNTIME: OFF
USE_ARM_COMPUTE_LIB_GRAPH_EXECUTOR: OFF
USE_CCACHE: AUTO
USE_ARM_COMPUTE_LIB: OFF
USE_CPP_RTVM:
USE_OPENCL_GTEST: /path/to/opencl/gtest
USE_MKL: OFF
USE_PT_TVMDSOOP: OFF
MLIR_VERSION: NOT-FOUND
USE_CLML: OFF
USE_STACKVM_RUNTIME: OFF
USE_GRAPH_EXECUTOR_CUDA_GRAPH: OFF
ROCM_PATH: /opt/rocm
USE_DNNL: OFF
USE_VITIS_AI: OFF
USE_MLIR: OFF
USE_RCCL: OFF
USE_LLVM: llvm-config --ignore-libllvm --link-static
USE_VERILATOR: OFF
USE_TF_TVMDSOOP: OFF
USE_THREADS: ON
USE_MSVC_MT: OFF
BACKTRACE_ON_SEGFAULT: OFF
USE_GRAPH_EXECUTOR: ON
USE_NCCL: ON
USE_ROCBLAS: OFF
GIT_COMMIT_HASH: 1ce4a34f3b9eabebaad959ddc67dfebede068028
USE_VULKAN: ON
USE_RUST_EXT: OFF
USE_CUTLASS: ON
USE_CPP_RPC: OFF
USE_HEXAGON: OFF
USE_CUSTOM_LOGGING: OFF
USE_UMA: OFF
USE_FALLBACK_STL_MAP: OFF
USE_SORT: ON
USE_RTTI: ON
GIT_COMMIT_TIME: 2024-03-21 21:54:55 -0400
USE_HEXAGON_SDK: /path/to/sdk
USE_BLAS: none
USE_ETHOSN: OFF
USE_LIBTORCH: OFF
USE_RANDOM: ON
USE_CUDA: ON
USE_COREML: OFF
USE_AMX: OFF
BUILD_STATIC_RUNTIME: OFF
USE_CMSISNN: OFF
USE_KHRONOS_SPIRV: OFF
USE_CLML_GRAPH_EXECUTOR: OFF
USE_TFLITE: OFF
USE_HEXAGON_GTEST: /path/to/hexagon/gtest
PICOJSON_PATH: 3rdparty/picojson
USE_OPENCL_ENABLE_HOST_PTR: OFF
INSTALL_DEV: OFF
USE_PROFILER: ON
USE_NNPACK: OFF
LLVM_VERSION: 15.0.7
USE_MRVL: OFF
USE_OPENCL: OFF
COMPILER_RT_PATH: 3rdparty/compiler-rt
RANG_PATH: 3rdparty/rang/include
USE_SPIRV_KHR_INTEGER_DOT_PRODUCT: OFF
USE_OPENMP: OFF
USE_BNNS: OFF
USE_FLASHINFER: ON
USE_CUBLAS: ON
USE_METAL: OFF
USE_MICRO_STANDALONE_RUNTIME: OFF
USE_HEXAGON_EXTERNAL_LIBS: OFF
USE_ALTERNATIVE_LINKER: AUTO
USE_BYODT_POSIT: OFF
USE_HEXAGON_RPC: OFF
USE_MICRO: OFF
DMLC_PATH: 3rdparty/dmlc-core/include
INDEX_DEFAULT_I64: ON
USE_RELAY_DEBUG: OFF
USE_RPC: ON
USE_TENSORFLOW_PATH: none
TVM_CLML_VERSION:
USE_MIOPEN: OFF
USE_ROCM: OFF
USE_PAPI: OFF
USE_CURAND: OFF
TVM_CXX_COMPILER_PATH: /opt/rh/gcc-toolset-11/root/usr/bin/c++
HIDE_PRIVATE_SYMBOLS: ON

mlc-ai / mlc-llm

[Bug] Generated texts not as expected on some models with ‘canonical simplification of LE’ problem #2015

🐛 Bug

1911

1919

To Reproduce

Expected behavior

Environment