mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
19.29k stars 1.59k forks source link

[Question] any reason why Vulkan (Windows) prebuilt is not provided? #2127

Closed hpxiong closed 6 months ago

hpxiong commented 7 months ago

❓ General Questions

I'm trying to see if MLC can run on Intel iGPU on windows but then no prebuilt Vulkan for Windows is available. I'm curious what is the reason this is not provided. Is there plan to provide this or it is not possible?

Sing-Li commented 7 months ago

@hpxiong can you please describe the exact problem you are having? Windows prebuilt is included in the build as far as one can determine:

image

Docs:

image

hpxiong commented 7 months ago

@Sing-Li Sorry for not being clear, I was referring to the model prebuilt not the library itself. Here is the table I'm referring to:

image

Sing-Li commented 7 months ago

@hpxiong the new SLM jit flow will actually convert the weights and compile the library specific for your system automatically. So the table above is outdated. Please test it on Window with vulkan and see if it works for you.

Docs: https://llm.mlc.ai/docs/deploy/cli.html

hpxiong commented 7 months ago

@hpxiong the new SLM jit flow will actually convert the weights and compile the library specific for your system automatically. So the table above is outdated. Please test it on Window with vulkan and see if it works for you.

Docs: https://llm.mlc.ai/docs/deploy/cli.html

Thanks for the suggestions.
This probably works but won't be an ideal solution for plug-n-playing using already converted models.

Compiling instruction for building Windows Vulkan models works fine. But I think it's a good idea to provide prebuilt models.

Sing-Li commented 7 months ago

But it is ideal. For any specific model, the converted weights are already available via huggingface at https://huggingface.co/mlc-ai While the platform specific library is automatically generated as detailed above. All you need to do is generate the lib once and you can "plug and play" copy/re-use it on any similarly configured system.

tqchen commented 7 months ago

Thanks for the suggestions. Indeed we recently are moving towards encouraging JIT compile to simplify our flow. Please checkout some of the latest tutorials https://llm.mlc.ai/docs/get_started/introduction.html

hpxiong commented 7 months ago

But it is ideal. For any specific model, the converted weights are already available via huggingface at https://huggingface.co/mlc-ai While the platform specific library is automatically generated as detailed above. All you need to do is generate the lib once and you can "plug and play" copy/re-use it on any similarly configured system.

@Sing-Li Thanks for providing additional information. I actually didn't realize the converted models are available to download. I followed model compilation instructions and pulled original models and did conversion. That was not fun.

I tried your suggestions and I ran into compile error attached below.

Besides the error, I have 2 questions:

  1. The process waited over 1 minute during the below step. This is not ideal. Anyway to make this faster?
[2024-04-18 10:15:18] INFO pipeline.py:50: Running TVM Relax graph-level optimizations
[2024-04-18 10:16:44] INFO pipeline.py:50: Lowering to TVM TIR kernels
  1. The compiled lib is stored in a temp location. Why not save it into the Model folder so no need to compile again?

I can contribute the compiled libs to the community if it helps.

Thanks!

Environment

Error log

(llm) C:\llm\models\mlc>mlc_llm chat dist\\models\\Llama-2-7b-chat-hf-q4f16_1-MLC
[2024-04-18 10:15:03] INFO auto_device.py:85: Not found device: cuda:0
[2024-04-18 10:15:05] INFO auto_device.py:85: Not found device: rocm:0
[2024-04-18 10:15:06] INFO auto_device.py:85: Not found device: metal:0
[2024-04-18 10:15:09] INFO auto_device.py:76: Found device: vulkan:0
[2024-04-18 10:15:09] INFO auto_device.py:76: Found device: vulkan:1
[2024-04-18 10:15:11] INFO auto_device.py:85: Not found device: opencl:0
[2024-04-18 10:15:11] INFO auto_device.py:33: Using device: vulkan:0
[2024-04-18 10:15:11] INFO chat_module.py:379: Using model folder: C:\llm\models\mlc\dist\models\Llama-2-7b-chat-hf-q4f16_1-MLC
[2024-04-18 10:15:11] INFO chat_module.py:380: Using mlc chat config: C:\llm\models\mlc\dist\models\Llama-2-7b-chat-hf-q4f16_1-MLC\mlc-chat-config.json
[2024-04-18 10:15:11] INFO chat_module.py:781: Model lib not found. Now compiling model lib on device...
[2024-04-18 10:15:11] INFO jit.py:35: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-04-18 10:15:11] INFO jit.py:94: Compiling using commands below:
[2024-04-18 10:15:11] INFO jit.py:95: 'C:\Users\me\.conda\envs\llm\python.exe' -m mlc_llm compile 'dist\models\Llama-2-7b-chat-hf-q4f16_1-MLC' --opt 'flashinfer=1;cublas_gemm=1;faster_transformer=1;cudagraph=0;cutlass=1;ipc_allreduce_strategy=NONE' --overrides 'context_window_size=4096;prefill_chunk_size=4096;tensor_parallel_shards=1' --device vulkan:0 --output 'C:\Users\me\AppData\Local\Temp\tmpv9s4rytg\lib.dll'
[2024-04-18 10:15:12] INFO auto_config.py:69: Found model configuration: dist\models\Llama-2-7b-chat-hf-q4f16_1-MLC\mlc-chat-config.json
[2024-04-18 10:15:12] INFO auto_target.py:84: Detecting target device: vulkan:0
[2024-04-18 10:15:12] INFO auto_target.py:86: Found target: {"thread_warp_size": 1, "supports_float32": T.bool(True), "supports_int16": 1, "supports_int32": T.bool(True), "max_threads_per_block": 1024, "supports_int8": 1, "supports_int64": 1, "max_num_threads": 256, "kind": "vulkan", "max_shared_memory_per_block": 32768, "supports_16bit_buffer": 1, "tag": "", "keys": ["vulkan", "gpu"], "supports_float16": 1}
[2024-04-18 10:15:12] INFO auto_target.py:103: Found host LLVM triple: x86_64-pc-windows-msvc
[2024-04-18 10:15:12] INFO auto_target.py:104: Found host LLVM CPU: alderlake
[2024-04-18 10:15:12] INFO auto_config.py:153: Found model type: llama. Use `--model-type` to override.
Compiling with arguments:
  --config          LlamaConfig(hidden_size=4096, intermediate_size=11008, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-06, vocab_size=32000, position_embedding_base=10000, context_window_size=4096, prefill_chunk_size=4096, num_key_value_heads=32, head_dim=128, tensor_parallel_shards=1, max_batch_size=80, kwargs={})
  --quantization    GroupQuantize(name='q4f16_1', kind='group-quant', group_size=32, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float16', linear_weight_layout='NK', quantize_embedding=True, quantize_final_fc=True, num_elem_per_storage=8, num_storage_per_group=4, max_int_value=7)
  --model-type      llama
  --target          {"thread_warp_size": 1, "host": {"mtriple": "x86_64-pc-windows-msvc", "tag": "", "kind": "llvm", "mcpu": "alderlake", "keys": ["cpu"]}, "supports_int16": 1, "supports_float32": T.bool(True), "supports_int32": T.bool(True), "max_threads_per_block": 1024, "supports_int8": 1, "supports_int64": 1, "max_num_threads": 256, "kind": "vulkan", "max_shared_memory_per_block": 32768, "supports_16bit_buffer": 1, "tag": "", "keys": ["vulkan", "gpu"], "supports_float16": 1}
  --opt             flashinfer=0;cublas_gemm=0;faster_transformer=0;cudagraph=0;cutlass=0;ipc_allreduce_strategy=NONE
  --system-lib-prefix ""
  --output          C:\Users\me\AppData\Local\Temp\tmpv9s4rytg\lib.dll
  --overrides       context_window_size=4096;sliding_window_size=None;prefill_chunk_size=4096;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=1
[2024-04-18 10:15:12] INFO config.py:106: Overriding context_window_size from 4096 to 4096
[2024-04-18 10:15:12] INFO config.py:106: Overriding prefill_chunk_size from 4096 to 4096
[2024-04-18 10:15:12] INFO config.py:106: Overriding tensor_parallel_shards from 1 to 1
[2024-04-18 10:15:12] INFO compile.py:137: Creating model from: LlamaConfig(hidden_size=4096, intermediate_size=11008, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-06, vocab_size=32000, position_embedding_base=10000, context_window_size=4096, prefill_chunk_size=4096, num_key_value_heads=32, head_dim=128, tensor_parallel_shards=1, max_batch_size=80, kwargs={})
[2024-04-18 10:15:12] INFO compile.py:156: Exporting the model to TVM Unity compiler
[2024-04-18 10:15:17] INFO compile.py:162: Running optimizations using TVM Unity
[2024-04-18 10:15:17] INFO compile.py:176: Registering metadata: {'model_type': 'llama', 'quantization': 'q4f16_1', 'context_window_size': 4096, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 4096, 'tensor_parallel_shards': 1, 'kv_cache_bytes': 0}
[2024-04-18 10:15:18] INFO pipeline.py:50: Running TVM Relax graph-level optimizations
[2024-04-18 10:16:44] INFO pipeline.py:50: Lowering to TVM TIR kernels
[2024-04-18 10:16:50] INFO pipeline.py:50: Running TVM TIR-level optimizations
[2024-04-18 10:17:09] INFO pipeline.py:50: Running TVM Dlight low-level optimizations
[10:17:09] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:17:09] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:17:09] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:17:09] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:17:09] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:17:09] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:17:09] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:17:10] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:17:10] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false:  Loop variable's dtype (int32) is narrower than that of `min` or `extent` (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:17:10] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false:  Loop variable's dtype (int32) is narrower than that of `min` or `extent` (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:17:10] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false:  Loop variable's dtype (int32) is narrower than that of `min` or `extent` (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:17:10] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false:  Loop variable's dtype (int32) is narrower than that of `min` or `extent` (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:17:10] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:17:10] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:17:10] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:17:10] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:17:10] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:17:10] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:17:10] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false:  Loop variable's dtype (int32) is narrower than that of `min` or `extent` (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:17:10] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false:  Loop variable's dtype (int32) is narrower than that of `min` or `extent` (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:17:10] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false:  Loop variable's dtype (int32) is narrower than that of `min` or `extent` (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:17:10] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false:  Loop variable's dtype (int32) is narrower than that of `min` or `extent` (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:17:10] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false:  Loop variable's dtype (int32) is narrower than that of `min` or `extent` (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:17:10] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false:  Loop variable's dtype (int32) is narrower than that of `min` or `extent` (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:17:10] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false:  Loop variable's dtype (int32) is narrower than that of `min` or `extent` (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:17:10] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false:  Loop variable's dtype (int32) is narrower than that of `min` or `extent` (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[2024-04-18 10:17:10] INFO pipeline.py:50: Lowering to VM bytecode
[2024-04-18 10:17:14] INFO estimate_memory_usage.py:57: [Memory usage] Function `alloc_embedding_tensor`: 32.00 MB
[2024-04-18 10:17:15] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_decode`: 9.02 MB
[2024-04-18 10:17:15] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_decode_to_last_hidden_states`: 9.65 MB
[2024-04-18 10:17:15] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_get_logits`: 0.62 MB
[2024-04-18 10:17:15] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_prefill`: 462.62 MB
[2024-04-18 10:17:15] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_prefill_to_last_hidden_states`: 494.00 MB
[2024-04-18 10:17:15] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_select_last_hidden_states`: 0.62 MB
[2024-04-18 10:17:15] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_verify`: 462.00 MB
[2024-04-18 10:17:15] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_verify_to_last_hidden_states`: 494.00 MB
[2024-04-18 10:17:15] INFO estimate_memory_usage.py:57: [Memory usage] Function `create_tir_paged_kv_cache`: 0.00 MB
[2024-04-18 10:17:15] INFO estimate_memory_usage.py:57: [Memory usage] Function `decode`: 0.11 MB
[2024-04-18 10:17:15] INFO estimate_memory_usage.py:57: [Memory usage] Function `decode_to_last_hidden_states`: 0.12 MB
[2024-04-18 10:17:15] INFO estimate_memory_usage.py:57: [Memory usage] Function `embed`: 32.00 MB
[2024-04-18 10:17:15] INFO estimate_memory_usage.py:57: [Memory usage] Function `get_logits`: 0.00 MB
[2024-04-18 10:17:15] INFO estimate_memory_usage.py:57: [Memory usage] Function `prefill`: 462.01 MB
[2024-04-18 10:17:15] INFO estimate_memory_usage.py:57: [Memory usage] Function `prefill_to_last_hidden_states`: 494.00 MB
[2024-04-18 10:17:15] INFO estimate_memory_usage.py:57: [Memory usage] Function `softmax_with_temperature`: 0.00 MB
[2024-04-18 10:17:17] INFO pipeline.py:50: Compiling external modules
[2024-04-18 10:17:17] INFO pipeline.py:50: Compilation complete! Exporting to disk
[2024-04-18 10:17:23] INFO model_metadata.py:96: Total memory usage: 4109.13 MB (Parameters: 3615.13 MB. KVCache: 0.00 MB. Temporary buffer: 494.00 MB)
[2024-04-18 10:17:23] INFO model_metadata.py:105: To reduce memory usage, tweak `prefill_chunk_size`, `context_window_size` and `sliding_window_size`
[2024-04-18 10:17:23] INFO compile.py:198: Generated: C:\Users\me\AppData\Local\Temp\tmpv9s4rytg\lib.dll
Traceback (most recent call last):
  File "C:\Users\me\.conda\envs\llm\Lib\site-packages\mlc_llm\chat_module.py", line 772, in __init__
    self.model_lib_path = _get_lib_module_path(
                          ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\me\.conda\envs\llm\Lib\site-packages\mlc_llm\chat_module.py", line 591, in _get_lib_module_path
    raise FileNotFoundError(err_msg)
FileNotFoundError: Cannot find the model library that corresponds to `None`.
`None` is either provided in the `chat_config` you passed in, or specified in dist\\models\\Llama-2-7b-chat-hf-q4f16_1-MLC\mlc-chat-config.json.
We searched over the following possible paths:
- None-vulkan.dll
- dist/prebuilt/lib/None-vulkan.dll
- dist/dist\\models\\Llama-2-7b-chat-hf-q4f16_1-MLC/None-vulkan.dll
- dist\\models\\Llama-2-7b-chat-hf-q4f16_1-MLC\None-vulkan.dll
- C:\llm\models\mlc\dist\models\None-vulkan.dll
If you would like to directly specify the model library path, you may consider passing in the `ChatModule.model_lib_path` parameter.
Please checkout https://github.com/mlc-ai/notebooks/blob/main/mlc-llm/tutorial_chat_module_getting_started.ipynb for an example on how to load a model.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\me\.conda\envs\llm\Scripts\mlc_llm.exe\__main__.py", line 7, in <module>
  File "C:\Users\me\.conda\envs\llm\Lib\site-packages\mlc_llm\__main__.py", line 37, in main
    cli.main(sys.argv[2:])
  File "C:\Users\me\.conda\envs\llm\Lib\site-packages\mlc_llm\cli\chat.py", line 41, in main
    chat(
  File "C:\Users\me\.conda\envs\llm\Lib\site-packages\mlc_llm\interface\chat.py", line 133, in chat
    cm = ChatModule(model, device, chat_config=config, model_lib_path=model_lib_path)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\me\.conda\envs\llm\Lib\site-packages\mlc_llm\chat_module.py", line 785, in __init__
    jit.jit(
  File "C:\Users\me\.conda\envs\llm\Lib\site-packages\mlc_llm\interface\jit.py", line 123, in jit
    _run_jit(
  File "C:\Users\me\.conda\envs\llm\Lib\site-packages\mlc_llm\interface\jit.py", line 96, in _run_jit
    subprocess.run(cmd, check=True)
  File "C:\Users\me\.conda\envs\llm\Lib\subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['C:\\Users\\me\\.conda\\envs\\llm\\python.exe', '-m', 'mlc_llm', 'compile', 'dist\\models\\Llama-2-7b-chat-hf-q4f16_1-MLC', '--opt', 'flashinfer=1;cublas_gemm=1;faster_transformer=1;cudagraph=0;cutlass=1;ipc_allreduce_strategy=NONE', '--overrides', 'context_window_size=4096;prefill_chunk_size=4096;tensor_parallel_shards=1', '--device', 'vulkan:0', '--output', 'C:\\Users\\me\\AppData\\Local\\Temp\\tmpv9s4rytg\\lib.dll']' returned non-zero exit status 3221226505.