mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
19.01k stars 1.56k forks source link

[Bug] Running Quick Start Example in Windows gives Error: 'InternalError: Check failed: (it != n->end()) is false: cannot find the corresponding key in the Map and MLCEngine' object has no attribute '_ffi' #2983

Open Seoulsim opened 6 days ago

Seoulsim commented 6 days ago

🐛 Bug

MLCengine code in quickstart guide on CPU fails with

'InternalError: Check failed: (it != n->end()) is false: cannot find the corresponding key in the Map'

followed by

MLCEngine' object has no attribute '_ffi'

To Reproduce

Steps to reproduce the behavior:

  1. Create python 3.10.6 virtual environment and activate it python -m venv ./venv .\venv2\Scripts\activate

  2. Install MLC for Windows on CPU+Vulkan according to https://llm.mlc.ai/docs/get_started/quick_start.html python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cpu mlc-ai-nightly-cpu

  3. Run the demo code snippet with CPU flag. python demo.py

demo.py

from mlc_llm import MLCEngine

# Create engine
model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC"
engine = MLCEngine(model, device="cpu")

# Run chat completion in OpenAI API.
for response in engine.chat.completions.create(
    messages=[{"role": "user", "content": "What is the meaning of life?"}],
    model=model,
    stream=True,
):
    for choice in response.choices:
        print(choice.delta.content, end="", flush=True)
print("\n")

engine.terminate()

I get the following error:

(venv2) PS D:\Documents\Project\mlc> python demo.py
[10:26:27] D:\a\package\package\tvm\src\target\llvm\llvm_instance.cc:226: Error: Using LLVM 19.1.2 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic`
[10:26:27] D:\a\package\package\tvm\src\target\llvm\llvm_instance.cc:226: Error: Using LLVM 19.1.2 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic` 
[10:26:27] D:\a\package\package\tvm\src\target\llvm\llvm_instance.cc:226: Error: Using LLVM 19.1.2 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic` 
[2024-10-16 10:26:29] INFO auto_device.py:79: Found device: cpu:0
[2024-10-16 10:26:29] INFO auto_device.py:79: Found device: cpu:1
[2024-10-16 10:26:29] INFO auto_device.py:79: Found device: cpu:2
[2024-10-16 10:26:29] INFO auto_device.py:79: Found device: cpu:3
[2024-10-16 10:26:29] INFO auto_device.py:79: Found device: cpu:4
[2024-10-16 10:26:29] INFO auto_device.py:79: Found device: cpu:5
[2024-10-16 10:26:29] INFO auto_device.py:79: Found device: cpu:6
[2024-10-16 10:26:29] INFO auto_device.py:79: Found device: cpu:7
[2024-10-16 10:26:29] INFO auto_device.py:79: Found device: cpu:8
[2024-10-16 10:26:29] INFO download_cache.py:227: Downloading model from HuggingFace: HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
[2024-10-16 10:26:29] INFO download_cache.py:29: MLC_DOWNLOAD_CACHE_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-10-16 10:26:29] INFO download_cache.py:166: Weights already downloaded: C:\Users\henry\AppData\Local\mlc_llm\model_weights\hf\mlc-ai\Llama-3-8B-Instruct-q4f16_1-MLC
[2024-10-16 10:26:29] INFO jit.py:43: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-10-16 10:26:29] INFO jit.py:118: Compiling using commands below:
[2024-10-16 10:26:29] INFO jit.py:119: 'D:\Documents\Project\mlc\venv2\Scripts\python.exe' -m mlc_llm compile 'C:\Users\henry\AppData\Local\mlc_llm\model_weights\hf\mlc-ai\Llama-3-8B-Instruct-q4f16_1-MLC' --opt 'flashinfer=1;cublas_gemm=1;faster_transformer=0;cudagraph=1;cutlass=1;ipc_allreduce_strategy=NONE' --overrides '' --device cpu:0 --output 'C:\Users\henry\AppData\Local\Temp\tmpz82lg7lt\lib.dll'
[10:26:30] D:\a\package\package\tvm\src\target\llvm\llvm_instance.cc:226: Error: Using LLVM 19.1.2 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic`
[10:26:30] D:\a\package\package\tvm\src\target\llvm\llvm_instance.cc:226: Error: Using LLVM 19.1.2 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic` 
[10:26:30] D:\a\package\package\tvm\src\target\llvm\llvm_instance.cc:226: Error: Using LLVM 19.1.2 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic` 
[2024-10-16 10:26:31] INFO auto_config.py:70: Found model configuration: C:\Users\henry\AppData\Local\mlc_llm\model_weights\hf\mlc-ai\Llama-3-8B-Instruct-q4f16_1-MLC\mlc-chat-config.json
[2024-10-16 10:26:31] INFO auto_target.py:91: Detecting target device: cpu:0
[2024-10-16 10:26:31] INFO auto_target.py:93: Found target: {"mtriple": "x86_64-pc-windows-msvc", "tag": "", "kind": "llvm", "mcpu": "znver3", "keys": ["cpu"]}
[2024-10-16 10:26:31] INFO auto_target.py:110: Found host LLVM triple: x86_64-pc-windows-msvc
[2024-10-16 10:26:31] INFO auto_target.py:111: Found host LLVM CPU: znver3
[2024-10-16 10:26:31] INFO auto_config.py:154: Found model type: llama. Use `--model-type` to override.
Compiling with arguments:
  --config          LlamaConfig(hidden_size=4096, intermediate_size=14336, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-05, vocab_size=128256, tie_word_embeddings=False, position_embedding_base=500000.0, rope_scaling=None, context_window_size=8192, prefill_chunk_size=8192, num_key_value_heads=8, head_dim=128, tensor_parallel_shards=1, pipeline_parallel_stages=1, max_batch_size=128, 
kwargs={})
  --quantization    GroupQuantize(name='q4f16_1', kind='group-quant', group_size=32, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float16', linear_weight_layout='NK', quantize_embedding=True, quantize_final_fc=True, num_elem_per_storage=8, num_storage_per_group=4, max_int_value=7, tensor_parallel_shards=0)
  --model-type      llama
  --target          {"host": {"mtriple": "x86_64-pc-windows-msvc", "tag": "", "kind": "llvm", "mcpu": "znver3", "keys": ["cpu"]}, "mtriple": "x86_64-pc-windows-msvc", "tag": "", "kind": "llvm", "mcpu": "znver3", "keys": ["cpu"]}
  --opt             flashinfer=0;cublas_gemm=0;faster_transformer=0;cudagraph=0;cutlass=0;ipc_allreduce_strategy=NONE
  --system-lib-prefix ""
  --output          C:\Users\henry\AppData\Local\Temp\tmpz82lg7lt\lib.dll
  --overrides       context_window_size=None;sliding_window_size=None;prefill_chunk_size=None;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=None;pipeline_parallel_stages=None     
[2024-10-16 10:26:31] INFO compile.py:140: Creating model from: LlamaConfig(hidden_size=4096, intermediate_size=14336, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-05, vocab_size=128256, tie_word_embeddings=False, position_embedding_base=500000.0, rope_scaling=None, context_window_size=8192, prefill_chunk_size=8192, num_key_value_heads=8, head_dim=128, tensor_parallel_shards=1, pipeline_parallel_stages=1, max_batch_size=128, kwargs={})
[2024-10-16 10:26:31] INFO compile.py:158: Exporting the model to TVM Unity compiler
[2024-10-16 10:26:35] INFO compile.py:164: Running optimizations using TVM Unity
[2024-10-16 10:26:35] INFO compile.py:185: Registering metadata: {'model_type': 'llama', 'quantization': 'q4f16_1', 'context_window_size': 8192, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 8192, 'tensor_parallel_shards': 1, 'pipeline_parallel_stages': 1, 'kv_state_kind': 'kv_cache', 'max_batch_size': 128}
Traceback (most recent call last):
  File "C:\Users\henry\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\henry\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\__main__.py", line 64, in <module>
    main()
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\__main__.py", line 33, in main
    cli.main(sys.argv[2:])
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\cli\compile.py", line 129, in main
    compile(
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\interface\compile.py", line 243, in compile
    _compile(args, model_config)
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\interface\compile.py", line 188, in _compile
    args.build_func(
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\support\auto_target.py", line 311, in build
    relax.build(
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\relax\vm_build.py", line 347, in build
    mod = pipeline(mod)
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\ir\transform.py", line 238, in __call__
    return _ffi_transform_api.RunPass(self, mod)
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\_ffi\_ctypes\packed_func.py", line 245, in __call__
    raise_last_ffi_error()
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\_ffi\base.py", line 481, in raise_last_ffi_error
    raise py_err
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\_ffi\_ctypes\packed_func.py", line 82, in cfun
    rv = local_pyfunc(*pyargs)
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\compiler_pass\pipeline.py", line 187, in _pipeline
    mod = seq(mod)
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\ir\transform.py", line 238, in __call__
    return _ffi_transform_api.RunPass(self, mod)
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\_ffi\_ctypes\packed_func.py", line 245, in __call__
    raise_last_ffi_error()
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\_ffi\base.py", line 481, in raise_last_ffi_error
    raise py_err
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\_ffi\_ctypes\packed_func.py", line 82, in cfun
    rv = local_pyfunc(*pyargs)
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\ir\transform.py", line 307, in _pass_func
    return inst.transform_module(mod, ctx)
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\compiler_pass\dispatch_kv_cache_creation.py", line 106, in transform_module
    self.create_tir_paged_kv_cache(bb, kwargs)
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\compiler_pass\dispatch_kv_cache_creation.py", line 145, in create_tir_paged_kv_cache
    cache = kv_cache.TIRPagedKVCache(target=self.target, **kwargs)
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\relax\frontend\nn\llm\kv_cache.py", line 367, in __init__
    bb.add_func(_attention_prefill(num_key_value_heads, num_attention_heads, head_dim, dtype, False, rope_scaling, target), "tir_attention_prefill"),
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\relax\frontend\nn\llm\kv_cache.py", line 565, in _attention_prefill
    check_thread_limits(target, bdx=bdx, bdy=num_warps, bdz=1, gdz=1)
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\relax\frontend\nn\llm\kv_cache.py", line 59, in check_thread_limits
    max_num_threads_per_block = get_max_num_threads_per_block(target)
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\relax\frontend\nn\llm\kv_cache.py", line 41, in get_max_num_threads_per_block
    max_num_threads = target.max_num_threads
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\target\target.py", line 198, in max_num_threads
    return int(self.attrs["max_num_threads"])
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\ir\container.py", line 62, in __getitem__
    return _ffi_api.MapGetItem(self, k)
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\_ffi\_ctypes\packed_func.py", line 245, in __call__
    raise_last_ffi_error()
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\_ffi\base.py", line 481, in raise_last_ffi_error
    raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
  File "D:\a\package\package\tvm\src\runtime\container.cc", line 153
[10:26:36] D:\a\package\package\tvm\src\relax\ir\block_builder.cc:65: Warning: BlockBuilder destroyed with remaining blocks!
Traceback (most recent call last):
  File "D:\Documents\Project\mlc\demo.py", line 6, in <module>
    engine = MLCEngine(model, device="cpu")
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\serve\engine.py", line 1467, in __init__
    super().__init__(
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\serve\engine_base.py", line 590, in __init__
    ) = _process_model_args(models, device, engine_config)
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\serve\engine_base.py", line 171, in _process_model_args
    model_args: List[Tuple[str, str]] = [_convert_model_info(model) for model in models]
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\serve\engine_base.py", line 171, in <listcomp>
    model_args: List[Tuple[str, str]] = [_convert_model_info(model) for model in models]
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\serve\engine_base.py", line 164, in _convert_model_info
    model_lib = jit.jit(
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\interface\jit.py", line 164, in jit
    _run_jit(
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\interface\jit.py", line 124, in _run_jit
    raise RuntimeError("Cannot find compilation output, compilation failed")
RuntimeError: Cannot find compilation output, compilation failed
Exception ignored in: <function MLCEngineBase.__del__ at 0x000001960BA7A320>
Traceback (most recent call last):
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\serve\engine_base.py", line 654, in __del__
    self.terminate()
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\serve\engine_base.py", line 661, in terminate
    self._ffi["exit_background_loop"]()
AttributeError: 'MLCEngine' object has no attribute '_ffi' 

Environment

Note: I am running on CPU only by setting device = "cpu" in the line engine = MLCEngine(model, device="cpu")

MasterJH5574 commented 6 days ago

Hi @Seoulsim thank you for reporting us. We actually don't support using "cpu" as device. Could you try to use device="vulkan"? Meanwhile we will add clearer error checks and messages to disable the use of "cpu".

Seoulsim commented 6 days ago

Hi @MasterJH5574 , Thanks for the headsup,when I run it with device="vulkan", MLCengine is able to compile everything except I get this error when exporting to disk regarding my vulkan target not supporting FP16, I'm currently running on an RX 580 with the latest drivers:

(venv2) PS D:\Documents\Project\mlc> python .\demo.py
[16:19:39] D:\a\package\package\tvm\src\target\llvm\llvm_instance.cc:226: Error: Using LLVM 19.1.2 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic`
[16:19:39] D:\a\package\package\tvm\src\target\llvm\llvm_instance.cc:226: Error: Using LLVM 19.1.2 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic` 
[16:19:39] D:\a\package\package\tvm\src\target\llvm\llvm_instance.cc:226: Error: Using LLVM 19.1.2 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic` 
[2024-10-16 16:19:42] INFO auto_device.py:79: Found device: vulkan:0
[2024-10-16 16:19:42] INFO download_cache.py:227: Downloading model from HuggingFace: HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
[2024-10-16 16:19:42] INFO download_cache.py:29: MLC_DOWNLOAD_CACHE_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-10-16 16:19:42] INFO download_cache.py:166: Weights already downloaded: C:\Users\henry\AppData\Local\mlc_llm\model_weights\hf\mlc-ai\Llama-3-8B-Instruct-q4f16_1-MLC
[2024-10-16 16:19:42] INFO jit.py:43: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-10-16 16:19:42] INFO jit.py:118: Compiling using commands below:
[2024-10-16 16:19:42] INFO jit.py:119: 'D:\Documents\Project\mlc\venv2\Scripts\python.exe' -m mlc_llm compile 'C:\Users\henry\AppData\Local\mlc_llm\model_weights\hf\mlc-ai\Llama-3-8B-Instruct-q4f16_1-MLC' --opt 'flashinfer=1;cublas_gemm=1;faster_transformer=0;cudagraph=1;cutlass=1;ipc_allreduce_strategy=NONE' --overrides '' --device vulkan:0 --output 'C:\Users\henry\AppData\Local\Temp\tmpj5fz5upl\lib.dll'
[16:19:42] D:\a\package\package\tvm\src\target\llvm\llvm_instance.cc:226: Error: Using LLVM 19.1.2 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic`
[16:19:42] D:\a\package\package\tvm\src\target\llvm\llvm_instance.cc:226: Error: Using LLVM 19.1.2 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic` 
[16:19:42] D:\a\package\package\tvm\src\target\llvm\llvm_instance.cc:226: Error: Using LLVM 19.1.2 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic` 
[2024-10-16 16:19:44] INFO auto_config.py:70: Found model configuration: C:\Users\henry\AppData\Local\mlc_llm\model_weights\hf\mlc-ai\Llama-3-8B-Instruct-q4f16_1-MLC\mlc-chat-config.json
[2024-10-16 16:19:44] INFO auto_target.py:91: Detecting target device: vulkan:0
[2024-10-16 16:19:44] INFO auto_target.py:93: Found target: {"thread_warp_size": runtime.BoxInt(1), "supports_float32": runtime.BoxBool(true), "supports_int16": runtime.BoxBool(true), "max_threads_per_block": runtime.BoxInt(1024), "supports_storage_buffer_storage_class": runtime.BoxBool(true), "supports_int8": runtime.BoxBool(true), "supports_8bit_buffer": runtime.BoxBool(true), "supports_int64": runtime.BoxBool(true), "max_num_threads": runtime.BoxInt(256), "kind": "vulkan", "tag": "", "max_shared_memory_per_block": runtime.BoxInt(32768), "supports_16bit_buffer": runtime.BoxBool(true), "supports_int32": runtime.BoxBool(true), "keys": ["vulkan", "gpu"], "supports_float16": runtime.BoxBool(false)}
[2024-10-16 16:19:44] INFO auto_target.py:110: Found host LLVM triple: x86_64-pc-windows-msvc
[2024-10-16 16:19:44] INFO auto_target.py:111: Found host LLVM CPU: znver3
[2024-10-16 16:19:44] INFO auto_config.py:154: Found model type: llama. Use `--model-type` to override.
Compiling with arguments:
  --config          LlamaConfig(hidden_size=4096, intermediate_size=14336, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-05, vocab_size=128256, tie_word_embeddings=False, position_embedding_base=500000.0, rope_scaling=None, context_window_size=8192, prefill_chunk_size=8192, num_key_value_heads=8, head_dim=128, tensor_parallel_shards=1, pipeline_parallel_stages=1, max_batch_size=128, kwargs={})
  --quantization    GroupQuantize(name='q4f16_1', kind='group-quant', group_size=32, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float16', linear_weight_layout='NK', quantize_embedding=True, quantize_final_fc=True, num_elem_per_storage=8, num_storage_per_group=4, max_int_value=7, tensor_parallel_shards=0)
  --model-type      llama
  --target          {"thread_warp_size": runtime.BoxInt(1), "host": {"mtriple": "x86_64-pc-windows-msvc", "tag": "", "kind": "llvm", "mcpu": "znver3", "keys": ["cpu"]}, "supports_float32": runtime.BoxBool(true), "supports_int16": runtime.BoxBool(true), "max_threads_per_block": runtime.BoxInt(1024), "supports_storage_buffer_storage_class": runtime.BoxBool(true), "supports_int8": runtime.BoxBool(true), "supports_8bit_buffer": runtime.BoxBool(true), "supports_int64": runtime.BoxBool(true), "max_num_threads": runtime.BoxInt(256), "kind": "vulkan", "tag": "", "max_shared_memory_per_block": runtime.BoxInt(32768), "supports_16bit_buffer": runtime.BoxBool(true), "supports_int32": runtime.BoxBool(true), "keys": ["vulkan", "gpu"], "supports_float16": runtime.BoxBool(false)}
  --opt             flashinfer=0;cublas_gemm=0;faster_transformer=0;cudagraph=0;cutlass=0;ipc_allreduce_strategy=NONE
  --system-lib-prefix ""
  --output          C:\Users\henry\AppData\Local\Temp\tmpj5fz5upl\lib.dll
  --overrides       context_window_size=None;sliding_window_size=None;prefill_chunk_size=None;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=None;pipeline_parallel_stages=None
[2024-10-16 16:19:44] INFO compile.py:140: Creating model from: LlamaConfig(hidden_size=4096, intermediate_size=14336, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-05, vocab_size=128256, tie_word_embeddings=False, position_embedding_base=500000.0, rope_scaling=None, context_window_size=8192, prefill_chunk_size=8192, num_key_value_heads=8, head_dim=128, tensor_parallel_shards=1, pipeline_parallel_stages=1, max_batch_size=128, kwargs={})
[2024-10-16 16:19:44] INFO compile.py:158: Exporting the model to TVM Unity compiler
[2024-10-16 16:19:48] INFO compile.py:164: Running optimizations using TVM Unity
[2024-10-16 16:19:48] INFO compile.py:185: Registering metadata: {'model_type': 'llama', 'quantization': 'q4f16_1', 'context_window_size': 8192, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 8192, 'tensor_parallel_shards': 1, 'pipeline_parallel_stages': 1, 'kv_state_kind': 'kv_cache', 'max_batch_size': 128}
[2024-10-16 16:19:50] INFO pipeline.py:54: Running TVM Relax graph-level optimizations
[2024-10-16 16:19:55] INFO pipeline.py:54: Lowering to TVM TIR kernels
[2024-10-16 16:20:03] INFO pipeline.py:54: Running TVM TIR-level optimizations
[2024-10-16 16:20:14] INFO pipeline.py:54: Running TVM Dlight low-level optimizations
[2024-10-16 16:20:15] INFO pipeline.py:54: Lowering to VM bytecode
[2024-10-16 16:20:18] INFO estimate_memory_usage.py:58: [Memory usage] Function `alloc_embedding_tensor`: 64.00 MB
[2024-10-16 16:20:18] INFO estimate_memory_usage.py:58: [Memory usage] Function `argsort_probs`: 0.00 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_decode`: 18.50 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_decode_to_last_hidden_states`: 19.50 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_prefill`: 1185.00 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_prefill_to_last_hidden_states`: 1248.00 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_select_last_hidden_states`: 1.00 MB       
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_verify`: 1184.00 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_verify_to_last_hidden_states`: 1248.00 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `create_tir_paged_kv_cache`: 0.00 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `decode`: 0.14 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `decode_to_last_hidden_states`: 0.15 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `embed`: 64.00 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `gather_hidden_states`: 0.00 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `get_logits`: 0.00 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `multinomial_from_uniform`: 0.00 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `prefill`: 1184.01 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `prefill_to_last_hidden_states`: 1248.00 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `renormalize_by_top_p`: 0.00 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `sample_with_top_p`: 0.00 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `sampler_take_probs`: 0.01 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `sampler_verify_draft_tokens`: 0.00 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `scatter_hidden_states`: 0.00 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `softmax_with_temperature`: 0.00 MB
[2024-10-16 16:20:20] INFO pipeline.py:54: Compiling external modules
[2024-10-16 16:20:20] INFO pipeline.py:54: Compilation complete! Exporting to disk
Traceback (most recent call last):
  File "C:\Users\henry\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\henry\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\__main__.py", line 64, in <module>
    main()
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\__main__.py", line 33, in main
    cli.main(sys.argv[2:])
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\cli\compile.py", line 129, in main
    compile(
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\interface\compile.py", line 243, in compile
    _compile(args, model_config)
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\interface\compile.py", line 188, in _compile
    args.build_func(
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\support\auto_target.py", line 311, in build
    relax.build(
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\relax\vm_build.py", line 353, in build
    return _vmlink(
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\relax\vm_build.py", line 249, in _vmlink
    lib = tvm.build(
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\driver\build_module.py", line 297, in build
    rt_mod_host = _driver_ffi.tir_to_runtime(annotated_mods, target_host)
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\_ffi\_ctypes\packed_func.py", line 245, in __call__
    raise_last_ffi_error()
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\_ffi\base.py", line 481, in raise_last_ffi_error
    raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
  File "D:\a\package\package\tvm\src\target\spirv\ir_builder.cc", line 566
InternalError: Check failed: (spirv_support_.supports_float16) is false: Vulkan target does not support Float16 capability.  If your device supports 16-bit float operations, please either add -supports_float16=1 to the target, or query all device parameters by adding -from_device=0.
Traceback (most recent call last):
  File "D:\Documents\Project\mlc\demo.py", line 5, in <module>
    engine = MLCEngine(model, device="vulkan")
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\serve\engine.py", line 1467, in __init__
    super().__init__(
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\serve\engine_base.py", line 590, in __init__
    ) = _process_model_args(models, device, engine_config)
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\serve\engine_base.py", line 171, in _process_model_args
    model_args: List[Tuple[str, str]] = [_convert_model_info(model) for model in models]
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\serve\engine_base.py", line 171, in <listcomp>
    model_args: List[Tuple[str, str]] = [_convert_model_info(model) for model in models]
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\serve\engine_base.py", line 164, in _convert_model_info
    model_lib = jit.jit(
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\interface\jit.py", line 164, in jit
    _run_jit(
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\interface\jit.py", line 124, in _run_jit
    raise RuntimeError("Cannot find compilation output, compilation failed")
RuntimeError: Cannot find compilation output, compilation failed
Exception ignored in: <function MLCEngineBase.__del__ at 0x000002D6C6D7E320>
Traceback (most recent call last):
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\serve\engine_base.py", line 654, in __del__
    self.terminate()
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\serve\engine_base.py", line 661, in terminate
    self._ffi["exit_background_loop"]()
AttributeError: 'MLCEngine' object has no attribute '_ffi'
Seoulsim commented 6 days ago

accidentally closed

vinx13 commented 6 days ago
InternalError: Check failed: (spirv_support_.supports_float16) is false: Vulkan target does not support Float16 capability.  If your device supports 16-bit float operations, please either add -supports_float16=1 to the target, or query all device parameters by adding -from_device=0.

There is another error in the middle.

MasterJH5574 commented 5 days ago

@Seoulsim Thank you. I may have missed this in the beginning, but may I ask if your device comes with a GPU? MLC currently requires a GPU.

Seoulsim commented 5 days ago

@MasterJH5574 My device is currently running a RX 580 with the latest drivers. Could it be that the GPU is too old for FP16 support ?

MasterJH5574 commented 1 day ago

@Seoulsim Yes it might be. Could you try to replace q4f16_1 in demo.py with q4f32_1 and see how things go?