mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
19.1k stars 1.57k forks source link

[Bug] mlc_llm chat not working: ValueError: Cannot find global var "multinomial_from_uniform1" in the Module #2418

Closed kalradivyanshu closed 5 months ago

kalradivyanshu commented 5 months ago

🐛 Bug

I ran mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC and it failed with ValueError: Cannot find global var "multinomial_from_uniform1" in the Module

To Reproduce

Steps to reproduce the behavior:

  1. Server with nvidia L4
    
    $ nvidia-smi
    +---------------------------------------------------------------------------------------+
    | NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
    |-----------------------------------------+----------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
    |                                         |                      |               MIG M. |
    |=========================================+======================+======================|
    |   0  NVIDIA L4                      On  | 00000000:00:03.0 Off |                    0 |
    | N/A   45C    P8              16W /  72W |      4MiB / 23034MiB |      0%      Default |
    |                                         |                      |                  N/A |
    +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+

cuda 12.1:

$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Mon_Apr__3_17:16:06_PDT_2023 Cuda compilation tools, release 12.1, V12.1.105 Build cuda_12.1.r12.1/compiler.32688072_0


2. Create new conda env: `conda create --name mlc python=3.11`
3. `conda activate mlc`
4. `conda install -c conda-forge gcc libvulkan-loader`
5. `conda install -c conda-forge libgcc-ng`
6. `python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cu121 mlc-ai-nightly-cu121`
7. run `mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC`

<details>
  <summary>Trace</summary>
  <pre>
   $ mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
    [2024-05-26 02:59:55] INFO auto_device.py:79: Found device: cuda:0
    [2024-05-26 02:59:57] INFO auto_device.py:88: Not found device: rocm:0
    [2024-05-26 02:59:59] INFO auto_device.py:88: Not found device: metal:0
    [2024-05-26 03:00:00] INFO auto_device.py:79: Found device: vulkan:0
    [2024-05-26 03:00:00] INFO auto_device.py:79: Found device: vulkan:1
    [2024-05-26 03:00:02] INFO auto_device.py:88: Not found device: opencl:0
    [2024-05-26 03:00:02] INFO auto_device.py:35: Using device: cuda:0
    [2024-05-26 03:00:02] INFO chat_module.py:362: Downloading model from HuggingFace: HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
    [2024-05-26 03:00:02] INFO download.py:133: Weights already downloaded: /home/divya/.cache/mlc_llm/model_weights/mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
    [2024-05-26 03:00:02] INFO jit.py:43: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
    [2024-05-26 03:00:02] INFO jit.py:120: Compiling using commands below:
    [2024-05-26 03:00:02] INFO jit.py:121: /opt/conda/envs/mlc/bin/python -m mlc_llm compile /home/divya/.cache/mlc_llm/model_weights/mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC --opt 'flashinfer=1;cublas_gemm=1;faster_transformer=1;cudagraph=0;cutlass=1;ipc_allreduce_strategy=NONE' --overrides 'context_window_size=8192;prefill_chunk_size=1024;tensor_parallel_shards=1' --device cuda:0 --output /var/tmp/tmpmynzzd8f/lib.so
    [2024-05-26 03:00:04] INFO auto_config.py:69: Found model configuration: /home/divya/.cache/mlc_llm/model_weights/mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC/mlc-chat-config.json
    [2024-05-26 03:00:04] INFO auto_target.py:84: Detecting target device: cuda:0
    [2024-05-26 03:00:04] INFO auto_target.py:86: Found target: {"thread_warp_size": 32, "arch": "sm_89", "max_threads_per_block": 1024, "max_num_threads": 1024, "kind": "cuda", "max_shared_memory_per_block": 49152, "tag": "", "keys": ["cuda", "gpu"]}
    [2024-05-26 03:00:04] INFO auto_target.py:103: Found host LLVM triple: x86_64-redhat-linux-gnu
    [2024-05-26 03:00:04] INFO auto_target.py:104: Found host LLVM CPU: cascadelake
    [2024-05-26 03:00:04] INFO auto_target.py:317: Generating code for CUDA architecture: sm_89
    [2024-05-26 03:00:04] INFO auto_target.py:318: To produce multi-arch fatbin, set environment variable MLC_MULTI_ARCH. Example: MLC_MULTI_ARCH=70,72,75,80,86,87,89,90a
    [2024-05-26 03:00:04] INFO auto_config.py:153: Found model type: llama. Use `--model-type` to override.
    Compiling with arguments:
      --config          LlamaConfig(hidden_size=4096, intermediate_size=14336, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-05, vocab_size=128256, position_embedding_base=500000.0, context_window_size=8192, prefill_chunk_size=1024, num_key_value_heads=8, head_dim=128, tensor_parallel_shards=1, max_batch_size=80, kwargs={})
      --quantization    GroupQuantize(name='q4f16_1', kind='group-quant', group_size=32, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float16', linear_weight_layout='NK', quantize_embedding=True, quantize_final_fc=True, num_elem_per_storage=8, num_storage_per_group=4, max_int_value=7)
      --model-type      llama
      --target          {"thread_warp_size": 32, "host": {"mtriple": "x86_64-redhat-linux-gnu", "tag": "", "kind": "llvm", "mcpu": "cascadelake", "keys": ["cpu"]}, "arch": "sm_89", "max_threads_per_block": 1024, "libs": ["thrust"], "max_num_threads": 1024, "kind": "cuda", "max_shared_memory_per_block": 49152, "tag": "", "keys": ["cuda", "gpu"]}
      --opt             flashinfer=1;cublas_gemm=0;faster_transformer=1;cudagraph=0;cutlass=1;ipc_allreduce_strategy=NONE
      --system-lib-prefix ""
      --output          /var/tmp/tmpmynzzd8f/lib.so
      --overrides       context_window_size=8192;sliding_window_size=None;prefill_chunk_size=1024;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=1
    [2024-05-26 03:00:04] INFO config.py:106: Overriding context_window_size from 8192 to 8192
    [2024-05-26 03:00:04] INFO config.py:106: Overriding prefill_chunk_size from 1024 to 1024
    [2024-05-26 03:00:04] INFO config.py:106: Overriding tensor_parallel_shards from 1 to 1
    [2024-05-26 03:00:04] INFO compile.py:146: Creating model from: LlamaConfig(hidden_size=4096, intermediate_size=14336, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-05, vocab_size=128256, position_embedding_base=500000.0, context_window_size=8192, prefill_chunk_size=1024, num_key_value_heads=8, head_dim=128, tensor_parallel_shards=1, max_batch_size=80, kwargs={})
    [2024-05-26 03:00:04] INFO compile.py:165: Exporting the model to TVM Unity compiler
    [2024-05-26 03:00:08] INFO compile.py:171: Running optimizations using TVM Unity
    [2024-05-26 03:00:08] INFO compile.py:191: Registering metadata: {'model_type': 'llama', 'quantization': 'q4f16_1', 'context_window_size': 8192, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 1024, 'tensor_parallel_shards': 1, 'kv_cache_bytes': 0, 'kv_state_kind': 'kv_cache'}
    Traceback (most recent call last):
      File "<frozen runpy>", line 198, in _run_module_as_main
      File "<frozen runpy>", line 88, in _run_code
      File "/opt/conda/envs/mlc/lib/python3.11/site-packages/mlc_llm/__main__.py", line 52, in <module>
        main()
      File "/opt/conda/envs/mlc/lib/python3.11/site-packages/mlc_llm/__main__.py", line 25, in main
        cli.main(sys.argv[2:])
      File "/opt/conda/envs/mlc/lib/python3.11/site-packages/mlc_llm/cli/compile.py", line 129, in main
        compile(
      File "/opt/conda/envs/mlc/lib/python3.11/site-packages/mlc_llm/interface/compile.py", line 249, in compile
        _compile(args, model_config)
      File "/opt/conda/envs/mlc/lib/python3.11/site-packages/mlc_llm/interface/compile.py", line 194, in _compile
        args.build_func(
      File "/opt/conda/envs/mlc/lib/python3.11/site-packages/mlc_llm/support/auto_target.py", line 284, in build
        relax.build(
      File "/opt/conda/envs/mlc/lib/python3.11/site-packages/tvm/relax/vm_build.py", line 335, in build
        mod = pipeline(mod)
              ^^^^^^^^^^^^^
      File "/opt/conda/envs/mlc/lib/python3.11/site-packages/tvm/ir/transform.py", line 238, in __call__
        return _ffi_transform_api.RunPass(self, mod)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__
      File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
      File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
      File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
      File "/opt/conda/envs/mlc/lib/python3.11/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
        raise py_err
      File "tvm/_ffi/_cython/./packed_func.pxi", line 56, in tvm._ffi._cy3.core.tvm_callback
      File "/opt/conda/envs/mlc/lib/python3.11/site-packages/mlc_llm/compiler_pass/pipeline.py", line 181, in _pipeline
        mod = seq(mod)
              ^^^^^^^^
      File "/opt/conda/envs/mlc/lib/python3.11/site-packages/tvm/ir/transform.py", line 238, in __call__
        return _ffi_transform_api.RunPass(self, mod)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__
      File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
      File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
      File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
      File "tvm/_ffi/_cython/./packed_func.pxi", line 56, in tvm._ffi._cy3.core.tvm_callback
      File "/opt/conda/envs/mlc/lib/python3.11/site-packages/tvm/ir/transform.py", line 307, in _pass_func
        return inst.transform_module(mod, ctx)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/conda/envs/mlc/lib/python3.11/site-packages/mlc_llm/compiler_pass/attach_sampler.py", line 51, in transform_module
        mod[gv_name]
        ~~~^^^^^^^^^
      File "/opt/conda/envs/mlc/lib/python3.11/site-packages/tvm/ir/module.py", line 139, in __getitem__
        return _ffi_api.Module_Lookup_str(self, var)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__
      File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
      File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
      File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
      File "/opt/conda/envs/mlc/lib/python3.11/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
        raise py_err
    ValueError: Traceback (most recent call last):
      3: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<tvm::BaseFunc (tvm::IRModule, tvm::runtime::String)>::AssignTypedLambda<tvm::__mk_TVM15::{lambda(tvm::IRModule, tvm::runtime::String)#1}>(tvm::__mk_TVM15::{lambda(tvm::IRModule, tvm::runtime::String)#1}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tvm::runtime::TVMRetValue)
      2: tvm::IRModuleNode::Lookup(tvm::runtime::String const&) const
      1: tvm::IRModuleNode::GetGlobalVar(tvm::runtime::String const&) const
      0: _ZN3tvm7runtime6deta
      File "/workspace/tvm/src/ir/module.cc", line 177
    ValueError: Cannot find global var "multinomial_from_uniform1" in the Module
    candidates are: ["full", "apply_penalty_inplace", "batch_prefill", "sample_with_top_p", "fused_rope", "argsort_probs", "tir_kv_cache_debug_get_kv", "apply_bitmask_inplace", "renormalize_by_top_p", "prefill", "take_sorted_probs", "get_logits", "get_renorm_prob", "softmax_with_temperature", "batch_decode_to_last_hidden_states", "batch_decode_paged_kv_sliding_window", "dequantize4", "apply_logit_bias_inplace", "batch_prefill_to_last_hidden_states", "dequantize2", "softmax_with_chunked_sum", "batch_decode", "tir_kv_cache_transpose_append", "alloc_embedding_tensor", "embed", "sampler_take_probs", "top_p_pivot_cutoff", "batch_prefill_paged_kv", "decode", "batch_verify_to_last_hidden_states", "chunk_lse", "decode_to_last_hidden_states", "create_flashinfer_paged_kv_cache", "multinomial_from_uniform", "batch_verify_on_gpu_single_kernel", "merge_state_inplace", "batch_select_last_hidden_states", "batch_prefill_paged_kv_sliding_window", "sampler_take_probs_tir", "dequantize", "create_tir_paged_kv_cache", "copy_single_page", "batch_prefill_ragged_kv", "top_p_renorm_after_cutoff", "sampler_verify_draft_tokens", "batch_verify", "dequantize3", "dequantize1", "prefill_to_last_hidden_states", "index", "batch_decode_paged_kv", "get_index_from_sorted"]
    Traceback (most recent call last):
      File "/opt/conda/envs/mlc/bin/mlc_llm", line 8, in <module>
        sys.exit(main())
                ^^^^^^
      File "/opt/conda/envs/mlc/lib/python3.11/site-packages/mlc_llm/__main__.py", line 37, in main
        cli.main(sys.argv[2:])
      File "/opt/conda/envs/mlc/lib/python3.11/site-packages/mlc_llm/cli/chat.py", line 30, in main
        chat(
      File "/opt/conda/envs/mlc/lib/python3.11/site-packages/mlc_llm/interface/chat.py", line 120, in chat
        engine = JSONFFIEngine(model, device, model_lib=model_lib, mode="interactive")
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/conda/envs/mlc/lib/python3.11/site-packages/mlc_llm/json_ffi/engine.py", line 217, in __init__
        model_args = _process_model_args(models, device)[0]
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/conda/envs/mlc/lib/python3.11/site-packages/mlc_llm/serve/engine_base.py", line 153, in _process_model_args
        model_args: List[Tuple[str, str]] = [_convert_model_info(model) for model in models]
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/conda/envs/mlc/lib/python3.11/site-packages/mlc_llm/serve/engine_base.py", line 153, in <listcomp>
        model_args: List[Tuple[str, str]] = [_convert_model_info(model) for model in models]
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/conda/envs/mlc/lib/python3.11/site-packages/mlc_llm/serve/engine_base.py", line 146, in _convert_model_info
        model_lib = jit.jit(
                    ^^^^^^^^
      File "/opt/conda/envs/mlc/lib/python3.11/site-packages/mlc_llm/interface/jit.py", line 166, in jit
        _run_jit(
      File "/opt/conda/envs/mlc/lib/python3.11/site-packages/mlc_llm/interface/jit.py", line 126, in _run_jit
        raise RuntimeError("Cannot find compilation output, compilation failed")
    RuntimeError: Cannot find compilation output, compilation failed
  </pre>
</details>
<!-- If you have a code sample, error messages, stack traces, please provide it here as well -->

## Expected behavior

Chat should run
## Environment

 - Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): CUDA 12.1
 - Operating system (e.g. Ubuntu/Windows/MacOS/...): Debian GNU/Linux 11 (bullseye)
 - Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...): Server with L4 GPU
 - How you installed MLC-LLM (`conda`, source): pip in conda env
 - How you installed TVM-Unity (`pip`, source): I am guessing it came with the pip wheel
 - Python version (e.g. 3.10): 3.11
 - GPU driver version (if applicable): cuda 12.1
 - CUDA/cuDNN version (if applicable): cuda 12.1
 - TVM Unity Hash Tag (`python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))"`, applicable if you compile models): none
 - Any other relevant information:

## Additional context

<!-- Add any other context about the problem here. -->
Hzfengsy commented 5 months ago

Could you please try upgrading or reinstalling your wheel to dev1287?

I'm not sure why the cu121 wheel was not upgraded to the latest version as the cu122 did. But now it should work

0xDEADFED5 commented 5 months ago

i had the same thing with dev1287.

i had to recompile my model libs with mlc_llm compile and then it works fine.

kalradivyanshu commented 5 months ago

@0xDEADFED5 can you tell me what exact step you took to resolve it?

kalradivyanshu commented 5 months ago

@Hzfengsy same issue as @0xDEADFED5 reported. I installed https://github.com/mlc-ai/package/releases/download/v0.9.dev0/mlc_llm_nightly_cu121-0.1.dev1287-cp311-cp311-manylinux_2_28_x86_64.whl directly.

0xDEADFED5 commented 5 months ago

@0xDEADFED5 can you tell me what exact step you took to resolve it?

mlc_llm compile -h will show you everything you need to know. this assumes you already have a quantized model with a config file. here's the syntax from my batch file (make sure to activate your venv first): (for linux/mac change backslash to forward slash and .dll to .so)

set src=C:\LLM\SFR-Iterative-DPO-LLaMA-3-8B-R
set quant=q4f16_1
set dst=%src%-MLC-%quant%
set model=auto
set device=auto
mlc_llm compile --quantization %quant% --model-type %model% --device %device% -o %dst%\None-vulkan.dll %dst%\mlc-chat-config.json

then when you run mlc_llm chat or mlc_llm serve you have to specify the lib you just compiled. here's my command-line for an example:

mlc_llm serve --model-lib C:\LLM\SFR-Iterative-DPO-LLaMA-3-8B-R-MLC-q4f16_1\None-vulkan.dll --mode server --speculative-mode disable C:\LLM\SFR-Iterative-DPO-LLaMA-3-8B-R-MLC-q4f16_1

kalradivyanshu commented 5 months ago

I can confirm as @0xDEADFED5 said, compiling model from scratch works and the error doesn't appear.

MasterJH5574 commented 5 months ago

@kalradivyanshu thanks for confirming! Glad that works :-)