mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
19.24k stars 1.58k forks source link

[Bug] Running Quick Start Example in Windows gives Error: 'InternalError: Check failed: (it != n->end()) is false: cannot find the corresponding key in the Map and MLCEngine' object has no attribute '_ffi' #2983

Open Seoulsim opened 1 month ago

Seoulsim commented 1 month ago

🐛 Bug

MLCengine code in quickstart guide on CPU fails with

'InternalError: Check failed: (it != n->end()) is false: cannot find the corresponding key in the Map'

followed by

MLCEngine' object has no attribute '_ffi'

To Reproduce

Steps to reproduce the behavior:

  1. Create python 3.10.6 virtual environment and activate it python -m venv ./venv .\venv2\Scripts\activate

  2. Install MLC for Windows on CPU+Vulkan according to https://llm.mlc.ai/docs/get_started/quick_start.html python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cpu mlc-ai-nightly-cpu

  3. Run the demo code snippet with CPU flag. python demo.py

demo.py

from mlc_llm import MLCEngine

# Create engine
model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC"
engine = MLCEngine(model, device="cpu")

# Run chat completion in OpenAI API.
for response in engine.chat.completions.create(
    messages=[{"role": "user", "content": "What is the meaning of life?"}],
    model=model,
    stream=True,
):
    for choice in response.choices:
        print(choice.delta.content, end="", flush=True)
print("\n")

engine.terminate()

I get the following error:

(venv2) PS D:\Documents\Project\mlc> python demo.py
[10:26:27] D:\a\package\package\tvm\src\target\llvm\llvm_instance.cc:226: Error: Using LLVM 19.1.2 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic`
[10:26:27] D:\a\package\package\tvm\src\target\llvm\llvm_instance.cc:226: Error: Using LLVM 19.1.2 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic` 
[10:26:27] D:\a\package\package\tvm\src\target\llvm\llvm_instance.cc:226: Error: Using LLVM 19.1.2 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic` 
[2024-10-16 10:26:29] INFO auto_device.py:79: Found device: cpu:0
[2024-10-16 10:26:29] INFO auto_device.py:79: Found device: cpu:1
[2024-10-16 10:26:29] INFO auto_device.py:79: Found device: cpu:2
[2024-10-16 10:26:29] INFO auto_device.py:79: Found device: cpu:3
[2024-10-16 10:26:29] INFO auto_device.py:79: Found device: cpu:4
[2024-10-16 10:26:29] INFO auto_device.py:79: Found device: cpu:5
[2024-10-16 10:26:29] INFO auto_device.py:79: Found device: cpu:6
[2024-10-16 10:26:29] INFO auto_device.py:79: Found device: cpu:7
[2024-10-16 10:26:29] INFO auto_device.py:79: Found device: cpu:8
[2024-10-16 10:26:29] INFO download_cache.py:227: Downloading model from HuggingFace: HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
[2024-10-16 10:26:29] INFO download_cache.py:29: MLC_DOWNLOAD_CACHE_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-10-16 10:26:29] INFO download_cache.py:166: Weights already downloaded: C:\Users\henry\AppData\Local\mlc_llm\model_weights\hf\mlc-ai\Llama-3-8B-Instruct-q4f16_1-MLC
[2024-10-16 10:26:29] INFO jit.py:43: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-10-16 10:26:29] INFO jit.py:118: Compiling using commands below:
[2024-10-16 10:26:29] INFO jit.py:119: 'D:\Documents\Project\mlc\venv2\Scripts\python.exe' -m mlc_llm compile 'C:\Users\henry\AppData\Local\mlc_llm\model_weights\hf\mlc-ai\Llama-3-8B-Instruct-q4f16_1-MLC' --opt 'flashinfer=1;cublas_gemm=1;faster_transformer=0;cudagraph=1;cutlass=1;ipc_allreduce_strategy=NONE' --overrides '' --device cpu:0 --output 'C:\Users\henry\AppData\Local\Temp\tmpz82lg7lt\lib.dll'
[10:26:30] D:\a\package\package\tvm\src\target\llvm\llvm_instance.cc:226: Error: Using LLVM 19.1.2 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic`
[10:26:30] D:\a\package\package\tvm\src\target\llvm\llvm_instance.cc:226: Error: Using LLVM 19.1.2 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic` 
[10:26:30] D:\a\package\package\tvm\src\target\llvm\llvm_instance.cc:226: Error: Using LLVM 19.1.2 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic` 
[2024-10-16 10:26:31] INFO auto_config.py:70: Found model configuration: C:\Users\henry\AppData\Local\mlc_llm\model_weights\hf\mlc-ai\Llama-3-8B-Instruct-q4f16_1-MLC\mlc-chat-config.json
[2024-10-16 10:26:31] INFO auto_target.py:91: Detecting target device: cpu:0
[2024-10-16 10:26:31] INFO auto_target.py:93: Found target: {"mtriple": "x86_64-pc-windows-msvc", "tag": "", "kind": "llvm", "mcpu": "znver3", "keys": ["cpu"]}
[2024-10-16 10:26:31] INFO auto_target.py:110: Found host LLVM triple: x86_64-pc-windows-msvc
[2024-10-16 10:26:31] INFO auto_target.py:111: Found host LLVM CPU: znver3
[2024-10-16 10:26:31] INFO auto_config.py:154: Found model type: llama. Use `--model-type` to override.
Compiling with arguments:
  --config          LlamaConfig(hidden_size=4096, intermediate_size=14336, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-05, vocab_size=128256, tie_word_embeddings=False, position_embedding_base=500000.0, rope_scaling=None, context_window_size=8192, prefill_chunk_size=8192, num_key_value_heads=8, head_dim=128, tensor_parallel_shards=1, pipeline_parallel_stages=1, max_batch_size=128, 
kwargs={})
  --quantization    GroupQuantize(name='q4f16_1', kind='group-quant', group_size=32, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float16', linear_weight_layout='NK', quantize_embedding=True, quantize_final_fc=True, num_elem_per_storage=8, num_storage_per_group=4, max_int_value=7, tensor_parallel_shards=0)
  --model-type      llama
  --target          {"host": {"mtriple": "x86_64-pc-windows-msvc", "tag": "", "kind": "llvm", "mcpu": "znver3", "keys": ["cpu"]}, "mtriple": "x86_64-pc-windows-msvc", "tag": "", "kind": "llvm", "mcpu": "znver3", "keys": ["cpu"]}
  --opt             flashinfer=0;cublas_gemm=0;faster_transformer=0;cudagraph=0;cutlass=0;ipc_allreduce_strategy=NONE
  --system-lib-prefix ""
  --output          C:\Users\henry\AppData\Local\Temp\tmpz82lg7lt\lib.dll
  --overrides       context_window_size=None;sliding_window_size=None;prefill_chunk_size=None;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=None;pipeline_parallel_stages=None     
[2024-10-16 10:26:31] INFO compile.py:140: Creating model from: LlamaConfig(hidden_size=4096, intermediate_size=14336, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-05, vocab_size=128256, tie_word_embeddings=False, position_embedding_base=500000.0, rope_scaling=None, context_window_size=8192, prefill_chunk_size=8192, num_key_value_heads=8, head_dim=128, tensor_parallel_shards=1, pipeline_parallel_stages=1, max_batch_size=128, kwargs={})
[2024-10-16 10:26:31] INFO compile.py:158: Exporting the model to TVM Unity compiler
[2024-10-16 10:26:35] INFO compile.py:164: Running optimizations using TVM Unity
[2024-10-16 10:26:35] INFO compile.py:185: Registering metadata: {'model_type': 'llama', 'quantization': 'q4f16_1', 'context_window_size': 8192, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 8192, 'tensor_parallel_shards': 1, 'pipeline_parallel_stages': 1, 'kv_state_kind': 'kv_cache', 'max_batch_size': 128}
Traceback (most recent call last):
  File "C:\Users\henry\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\henry\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\__main__.py", line 64, in <module>
    main()
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\__main__.py", line 33, in main
    cli.main(sys.argv[2:])
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\cli\compile.py", line 129, in main
    compile(
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\interface\compile.py", line 243, in compile
    _compile(args, model_config)
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\interface\compile.py", line 188, in _compile
    args.build_func(
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\support\auto_target.py", line 311, in build
    relax.build(
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\relax\vm_build.py", line 347, in build
    mod = pipeline(mod)
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\ir\transform.py", line 238, in __call__
    return _ffi_transform_api.RunPass(self, mod)
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\_ffi\_ctypes\packed_func.py", line 245, in __call__
    raise_last_ffi_error()
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\_ffi\base.py", line 481, in raise_last_ffi_error
    raise py_err
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\_ffi\_ctypes\packed_func.py", line 82, in cfun
    rv = local_pyfunc(*pyargs)
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\compiler_pass\pipeline.py", line 187, in _pipeline
    mod = seq(mod)
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\ir\transform.py", line 238, in __call__
    return _ffi_transform_api.RunPass(self, mod)
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\_ffi\_ctypes\packed_func.py", line 245, in __call__
    raise_last_ffi_error()
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\_ffi\base.py", line 481, in raise_last_ffi_error
    raise py_err
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\_ffi\_ctypes\packed_func.py", line 82, in cfun
    rv = local_pyfunc(*pyargs)
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\ir\transform.py", line 307, in _pass_func
    return inst.transform_module(mod, ctx)
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\compiler_pass\dispatch_kv_cache_creation.py", line 106, in transform_module
    self.create_tir_paged_kv_cache(bb, kwargs)
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\compiler_pass\dispatch_kv_cache_creation.py", line 145, in create_tir_paged_kv_cache
    cache = kv_cache.TIRPagedKVCache(target=self.target, **kwargs)
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\relax\frontend\nn\llm\kv_cache.py", line 367, in __init__
    bb.add_func(_attention_prefill(num_key_value_heads, num_attention_heads, head_dim, dtype, False, rope_scaling, target), "tir_attention_prefill"),
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\relax\frontend\nn\llm\kv_cache.py", line 565, in _attention_prefill
    check_thread_limits(target, bdx=bdx, bdy=num_warps, bdz=1, gdz=1)
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\relax\frontend\nn\llm\kv_cache.py", line 59, in check_thread_limits
    max_num_threads_per_block = get_max_num_threads_per_block(target)
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\relax\frontend\nn\llm\kv_cache.py", line 41, in get_max_num_threads_per_block
    max_num_threads = target.max_num_threads
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\target\target.py", line 198, in max_num_threads
    return int(self.attrs["max_num_threads"])
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\ir\container.py", line 62, in __getitem__
    return _ffi_api.MapGetItem(self, k)
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\_ffi\_ctypes\packed_func.py", line 245, in __call__
    raise_last_ffi_error()
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\_ffi\base.py", line 481, in raise_last_ffi_error
    raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
  File "D:\a\package\package\tvm\src\runtime\container.cc", line 153
[10:26:36] D:\a\package\package\tvm\src\relax\ir\block_builder.cc:65: Warning: BlockBuilder destroyed with remaining blocks!
Traceback (most recent call last):
  File "D:\Documents\Project\mlc\demo.py", line 6, in <module>
    engine = MLCEngine(model, device="cpu")
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\serve\engine.py", line 1467, in __init__
    super().__init__(
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\serve\engine_base.py", line 590, in __init__
    ) = _process_model_args(models, device, engine_config)
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\serve\engine_base.py", line 171, in _process_model_args
    model_args: List[Tuple[str, str]] = [_convert_model_info(model) for model in models]
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\serve\engine_base.py", line 171, in <listcomp>
    model_args: List[Tuple[str, str]] = [_convert_model_info(model) for model in models]
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\serve\engine_base.py", line 164, in _convert_model_info
    model_lib = jit.jit(
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\interface\jit.py", line 164, in jit
    _run_jit(
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\interface\jit.py", line 124, in _run_jit
    raise RuntimeError("Cannot find compilation output, compilation failed")
RuntimeError: Cannot find compilation output, compilation failed
Exception ignored in: <function MLCEngineBase.__del__ at 0x000001960BA7A320>
Traceback (most recent call last):
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\serve\engine_base.py", line 654, in __del__
    self.terminate()
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\serve\engine_base.py", line 661, in terminate
    self._ffi["exit_background_loop"]()
AttributeError: 'MLCEngine' object has no attribute '_ffi' 

Environment

Note: I am running on CPU only by setting device = "cpu" in the line engine = MLCEngine(model, device="cpu")

MasterJH5574 commented 1 month ago

Hi @Seoulsim thank you for reporting us. We actually don't support using "cpu" as device. Could you try to use device="vulkan"? Meanwhile we will add clearer error checks and messages to disable the use of "cpu".

Seoulsim commented 1 month ago

Hi @MasterJH5574 , Thanks for the headsup,when I run it with device="vulkan", MLCengine is able to compile everything except I get this error when exporting to disk regarding my vulkan target not supporting FP16, I'm currently running on an RX 580 with the latest drivers:

(venv2) PS D:\Documents\Project\mlc> python .\demo.py
[16:19:39] D:\a\package\package\tvm\src\target\llvm\llvm_instance.cc:226: Error: Using LLVM 19.1.2 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic`
[16:19:39] D:\a\package\package\tvm\src\target\llvm\llvm_instance.cc:226: Error: Using LLVM 19.1.2 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic` 
[16:19:39] D:\a\package\package\tvm\src\target\llvm\llvm_instance.cc:226: Error: Using LLVM 19.1.2 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic` 
[2024-10-16 16:19:42] INFO auto_device.py:79: Found device: vulkan:0
[2024-10-16 16:19:42] INFO download_cache.py:227: Downloading model from HuggingFace: HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
[2024-10-16 16:19:42] INFO download_cache.py:29: MLC_DOWNLOAD_CACHE_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-10-16 16:19:42] INFO download_cache.py:166: Weights already downloaded: C:\Users\henry\AppData\Local\mlc_llm\model_weights\hf\mlc-ai\Llama-3-8B-Instruct-q4f16_1-MLC
[2024-10-16 16:19:42] INFO jit.py:43: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-10-16 16:19:42] INFO jit.py:118: Compiling using commands below:
[2024-10-16 16:19:42] INFO jit.py:119: 'D:\Documents\Project\mlc\venv2\Scripts\python.exe' -m mlc_llm compile 'C:\Users\henry\AppData\Local\mlc_llm\model_weights\hf\mlc-ai\Llama-3-8B-Instruct-q4f16_1-MLC' --opt 'flashinfer=1;cublas_gemm=1;faster_transformer=0;cudagraph=1;cutlass=1;ipc_allreduce_strategy=NONE' --overrides '' --device vulkan:0 --output 'C:\Users\henry\AppData\Local\Temp\tmpj5fz5upl\lib.dll'
[16:19:42] D:\a\package\package\tvm\src\target\llvm\llvm_instance.cc:226: Error: Using LLVM 19.1.2 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic`
[16:19:42] D:\a\package\package\tvm\src\target\llvm\llvm_instance.cc:226: Error: Using LLVM 19.1.2 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic` 
[16:19:42] D:\a\package\package\tvm\src\target\llvm\llvm_instance.cc:226: Error: Using LLVM 19.1.2 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic` 
[2024-10-16 16:19:44] INFO auto_config.py:70: Found model configuration: C:\Users\henry\AppData\Local\mlc_llm\model_weights\hf\mlc-ai\Llama-3-8B-Instruct-q4f16_1-MLC\mlc-chat-config.json
[2024-10-16 16:19:44] INFO auto_target.py:91: Detecting target device: vulkan:0
[2024-10-16 16:19:44] INFO auto_target.py:93: Found target: {"thread_warp_size": runtime.BoxInt(1), "supports_float32": runtime.BoxBool(true), "supports_int16": runtime.BoxBool(true), "max_threads_per_block": runtime.BoxInt(1024), "supports_storage_buffer_storage_class": runtime.BoxBool(true), "supports_int8": runtime.BoxBool(true), "supports_8bit_buffer": runtime.BoxBool(true), "supports_int64": runtime.BoxBool(true), "max_num_threads": runtime.BoxInt(256), "kind": "vulkan", "tag": "", "max_shared_memory_per_block": runtime.BoxInt(32768), "supports_16bit_buffer": runtime.BoxBool(true), "supports_int32": runtime.BoxBool(true), "keys": ["vulkan", "gpu"], "supports_float16": runtime.BoxBool(false)}
[2024-10-16 16:19:44] INFO auto_target.py:110: Found host LLVM triple: x86_64-pc-windows-msvc
[2024-10-16 16:19:44] INFO auto_target.py:111: Found host LLVM CPU: znver3
[2024-10-16 16:19:44] INFO auto_config.py:154: Found model type: llama. Use `--model-type` to override.
Compiling with arguments:
  --config          LlamaConfig(hidden_size=4096, intermediate_size=14336, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-05, vocab_size=128256, tie_word_embeddings=False, position_embedding_base=500000.0, rope_scaling=None, context_window_size=8192, prefill_chunk_size=8192, num_key_value_heads=8, head_dim=128, tensor_parallel_shards=1, pipeline_parallel_stages=1, max_batch_size=128, kwargs={})
  --quantization    GroupQuantize(name='q4f16_1', kind='group-quant', group_size=32, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float16', linear_weight_layout='NK', quantize_embedding=True, quantize_final_fc=True, num_elem_per_storage=8, num_storage_per_group=4, max_int_value=7, tensor_parallel_shards=0)
  --model-type      llama
  --target          {"thread_warp_size": runtime.BoxInt(1), "host": {"mtriple": "x86_64-pc-windows-msvc", "tag": "", "kind": "llvm", "mcpu": "znver3", "keys": ["cpu"]}, "supports_float32": runtime.BoxBool(true), "supports_int16": runtime.BoxBool(true), "max_threads_per_block": runtime.BoxInt(1024), "supports_storage_buffer_storage_class": runtime.BoxBool(true), "supports_int8": runtime.BoxBool(true), "supports_8bit_buffer": runtime.BoxBool(true), "supports_int64": runtime.BoxBool(true), "max_num_threads": runtime.BoxInt(256), "kind": "vulkan", "tag": "", "max_shared_memory_per_block": runtime.BoxInt(32768), "supports_16bit_buffer": runtime.BoxBool(true), "supports_int32": runtime.BoxBool(true), "keys": ["vulkan", "gpu"], "supports_float16": runtime.BoxBool(false)}
  --opt             flashinfer=0;cublas_gemm=0;faster_transformer=0;cudagraph=0;cutlass=0;ipc_allreduce_strategy=NONE
  --system-lib-prefix ""
  --output          C:\Users\henry\AppData\Local\Temp\tmpj5fz5upl\lib.dll
  --overrides       context_window_size=None;sliding_window_size=None;prefill_chunk_size=None;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=None;pipeline_parallel_stages=None
[2024-10-16 16:19:44] INFO compile.py:140: Creating model from: LlamaConfig(hidden_size=4096, intermediate_size=14336, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-05, vocab_size=128256, tie_word_embeddings=False, position_embedding_base=500000.0, rope_scaling=None, context_window_size=8192, prefill_chunk_size=8192, num_key_value_heads=8, head_dim=128, tensor_parallel_shards=1, pipeline_parallel_stages=1, max_batch_size=128, kwargs={})
[2024-10-16 16:19:44] INFO compile.py:158: Exporting the model to TVM Unity compiler
[2024-10-16 16:19:48] INFO compile.py:164: Running optimizations using TVM Unity
[2024-10-16 16:19:48] INFO compile.py:185: Registering metadata: {'model_type': 'llama', 'quantization': 'q4f16_1', 'context_window_size': 8192, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 8192, 'tensor_parallel_shards': 1, 'pipeline_parallel_stages': 1, 'kv_state_kind': 'kv_cache', 'max_batch_size': 128}
[2024-10-16 16:19:50] INFO pipeline.py:54: Running TVM Relax graph-level optimizations
[2024-10-16 16:19:55] INFO pipeline.py:54: Lowering to TVM TIR kernels
[2024-10-16 16:20:03] INFO pipeline.py:54: Running TVM TIR-level optimizations
[2024-10-16 16:20:14] INFO pipeline.py:54: Running TVM Dlight low-level optimizations
[2024-10-16 16:20:15] INFO pipeline.py:54: Lowering to VM bytecode
[2024-10-16 16:20:18] INFO estimate_memory_usage.py:58: [Memory usage] Function `alloc_embedding_tensor`: 64.00 MB
[2024-10-16 16:20:18] INFO estimate_memory_usage.py:58: [Memory usage] Function `argsort_probs`: 0.00 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_decode`: 18.50 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_decode_to_last_hidden_states`: 19.50 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_prefill`: 1185.00 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_prefill_to_last_hidden_states`: 1248.00 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_select_last_hidden_states`: 1.00 MB       
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_verify`: 1184.00 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_verify_to_last_hidden_states`: 1248.00 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `create_tir_paged_kv_cache`: 0.00 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `decode`: 0.14 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `decode_to_last_hidden_states`: 0.15 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `embed`: 64.00 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `gather_hidden_states`: 0.00 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `get_logits`: 0.00 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `multinomial_from_uniform`: 0.00 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `prefill`: 1184.01 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `prefill_to_last_hidden_states`: 1248.00 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `renormalize_by_top_p`: 0.00 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `sample_with_top_p`: 0.00 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `sampler_take_probs`: 0.01 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `sampler_verify_draft_tokens`: 0.00 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `scatter_hidden_states`: 0.00 MB
[2024-10-16 16:20:19] INFO estimate_memory_usage.py:58: [Memory usage] Function `softmax_with_temperature`: 0.00 MB
[2024-10-16 16:20:20] INFO pipeline.py:54: Compiling external modules
[2024-10-16 16:20:20] INFO pipeline.py:54: Compilation complete! Exporting to disk
Traceback (most recent call last):
  File "C:\Users\henry\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\henry\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\__main__.py", line 64, in <module>
    main()
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\__main__.py", line 33, in main
    cli.main(sys.argv[2:])
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\cli\compile.py", line 129, in main
    compile(
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\interface\compile.py", line 243, in compile
    _compile(args, model_config)
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\interface\compile.py", line 188, in _compile
    args.build_func(
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\support\auto_target.py", line 311, in build
    relax.build(
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\relax\vm_build.py", line 353, in build
    return _vmlink(
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\relax\vm_build.py", line 249, in _vmlink
    lib = tvm.build(
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\driver\build_module.py", line 297, in build
    rt_mod_host = _driver_ffi.tir_to_runtime(annotated_mods, target_host)
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\_ffi\_ctypes\packed_func.py", line 245, in __call__
    raise_last_ffi_error()
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\tvm\_ffi\base.py", line 481, in raise_last_ffi_error
    raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
  File "D:\a\package\package\tvm\src\target\spirv\ir_builder.cc", line 566
InternalError: Check failed: (spirv_support_.supports_float16) is false: Vulkan target does not support Float16 capability.  If your device supports 16-bit float operations, please either add -supports_float16=1 to the target, or query all device parameters by adding -from_device=0.
Traceback (most recent call last):
  File "D:\Documents\Project\mlc\demo.py", line 5, in <module>
    engine = MLCEngine(model, device="vulkan")
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\serve\engine.py", line 1467, in __init__
    super().__init__(
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\serve\engine_base.py", line 590, in __init__
    ) = _process_model_args(models, device, engine_config)
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\serve\engine_base.py", line 171, in _process_model_args
    model_args: List[Tuple[str, str]] = [_convert_model_info(model) for model in models]
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\serve\engine_base.py", line 171, in <listcomp>
    model_args: List[Tuple[str, str]] = [_convert_model_info(model) for model in models]
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\serve\engine_base.py", line 164, in _convert_model_info
    model_lib = jit.jit(
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\interface\jit.py", line 164, in jit
    _run_jit(
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\interface\jit.py", line 124, in _run_jit
    raise RuntimeError("Cannot find compilation output, compilation failed")
RuntimeError: Cannot find compilation output, compilation failed
Exception ignored in: <function MLCEngineBase.__del__ at 0x000002D6C6D7E320>
Traceback (most recent call last):
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\serve\engine_base.py", line 654, in __del__
    self.terminate()
  File "D:\Documents\Project\mlc\venv2\lib\site-packages\mlc_llm\serve\engine_base.py", line 661, in terminate
    self._ffi["exit_background_loop"]()
AttributeError: 'MLCEngine' object has no attribute '_ffi'
Seoulsim commented 1 month ago

accidentally closed

vinx13 commented 1 month ago
InternalError: Check failed: (spirv_support_.supports_float16) is false: Vulkan target does not support Float16 capability.  If your device supports 16-bit float operations, please either add -supports_float16=1 to the target, or query all device parameters by adding -from_device=0.

There is another error in the middle.

MasterJH5574 commented 1 month ago

@Seoulsim Thank you. I may have missed this in the beginning, but may I ask if your device comes with a GPU? MLC currently requires a GPU.

Seoulsim commented 1 month ago

@MasterJH5574 My device is currently running a RX 580 with the latest drivers. Could it be that the GPU is too old for FP16 support ?

MasterJH5574 commented 1 month ago

@Seoulsim Yes it might be. Could you try to replace q4f16_1 in demo.py with q4f32_1 and see how things go?

kripper commented 18 hours ago

We actually don't support using "cpu" as device

Please support cpu, so we can use those big machines with 256 GB RAM for big models. Besides, since shared gpu memory is slow, it will be usefull to combine GPU (on VRAM)+ CPU (on RAM) inference (like llama.cpp is doing). On https://github.com/mlc-ai/mlc-llm/pull/2903 you seem to be working on CPU support and TVM also supports CPU.