Closed hpxiong closed 6 months ago
@hpxiong can you please describe the exact problem you are having? Windows prebuilt is included in the build as far as one can determine:
Docs:
@Sing-Li Sorry for not being clear, I was referring to the model prebuilt not the library itself. Here is the table I'm referring to:
@hpxiong the new SLM jit flow will actually convert the weights and compile the library specific for your system automatically. So the table above is outdated. Please test it on Window with vulkan and see if it works for you.
@hpxiong the new SLM jit flow will actually convert the weights and compile the library specific for your system automatically. So the table above is outdated. Please test it on Window with vulkan and see if it works for you.
Thanks for the suggestions.
This probably works but won't be an ideal solution for plug-n-playing using already converted models.
Compiling instruction for building Windows Vulkan models works fine. But I think it's a good idea to provide prebuilt models.
But it is ideal. For any specific model, the converted weights are already available via huggingface at https://huggingface.co/mlc-ai While the platform specific library is automatically generated as detailed above. All you need to do is generate the lib once and you can "plug and play" copy/re-use it on any similarly configured system.
Thanks for the suggestions. Indeed we recently are moving towards encouraging JIT compile to simplify our flow. Please checkout some of the latest tutorials https://llm.mlc.ai/docs/get_started/introduction.html
But it is ideal. For any specific model, the converted weights are already available via huggingface at https://huggingface.co/mlc-ai While the platform specific library is automatically generated as detailed above. All you need to do is generate the lib once and you can "plug and play" copy/re-use it on any similarly configured system.
@Sing-Li Thanks for providing additional information. I actually didn't realize the converted models are available to download. I followed model compilation instructions and pulled original models and did conversion. That was not fun.
I tried your suggestions and I ran into compile error attached below.
Besides the error, I have 2 questions:
[2024-04-18 10:15:18] INFO pipeline.py:50: Running TVM Relax graph-level optimizations
[2024-04-18 10:16:44] INFO pipeline.py:50: Lowering to TVM TIR kernels
I can contribute the compiled libs to the community if it helps.
Thanks!
(llm) C:\llm\models\mlc>mlc_llm chat dist\\models\\Llama-2-7b-chat-hf-q4f16_1-MLC
[2024-04-18 10:15:03] INFO auto_device.py:85: Not found device: cuda:0
[2024-04-18 10:15:05] INFO auto_device.py:85: Not found device: rocm:0
[2024-04-18 10:15:06] INFO auto_device.py:85: Not found device: metal:0
[2024-04-18 10:15:09] INFO auto_device.py:76: Found device: vulkan:0
[2024-04-18 10:15:09] INFO auto_device.py:76: Found device: vulkan:1
[2024-04-18 10:15:11] INFO auto_device.py:85: Not found device: opencl:0
[2024-04-18 10:15:11] INFO auto_device.py:33: Using device: vulkan:0
[2024-04-18 10:15:11] INFO chat_module.py:379: Using model folder: C:\llm\models\mlc\dist\models\Llama-2-7b-chat-hf-q4f16_1-MLC
[2024-04-18 10:15:11] INFO chat_module.py:380: Using mlc chat config: C:\llm\models\mlc\dist\models\Llama-2-7b-chat-hf-q4f16_1-MLC\mlc-chat-config.json
[2024-04-18 10:15:11] INFO chat_module.py:781: Model lib not found. Now compiling model lib on device...
[2024-04-18 10:15:11] INFO jit.py:35: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-04-18 10:15:11] INFO jit.py:94: Compiling using commands below:
[2024-04-18 10:15:11] INFO jit.py:95: 'C:\Users\me\.conda\envs\llm\python.exe' -m mlc_llm compile 'dist\models\Llama-2-7b-chat-hf-q4f16_1-MLC' --opt 'flashinfer=1;cublas_gemm=1;faster_transformer=1;cudagraph=0;cutlass=1;ipc_allreduce_strategy=NONE' --overrides 'context_window_size=4096;prefill_chunk_size=4096;tensor_parallel_shards=1' --device vulkan:0 --output 'C:\Users\me\AppData\Local\Temp\tmpv9s4rytg\lib.dll'
[2024-04-18 10:15:12] INFO auto_config.py:69: Found model configuration: dist\models\Llama-2-7b-chat-hf-q4f16_1-MLC\mlc-chat-config.json
[2024-04-18 10:15:12] INFO auto_target.py:84: Detecting target device: vulkan:0
[2024-04-18 10:15:12] INFO auto_target.py:86: Found target: {"thread_warp_size": 1, "supports_float32": T.bool(True), "supports_int16": 1, "supports_int32": T.bool(True), "max_threads_per_block": 1024, "supports_int8": 1, "supports_int64": 1, "max_num_threads": 256, "kind": "vulkan", "max_shared_memory_per_block": 32768, "supports_16bit_buffer": 1, "tag": "", "keys": ["vulkan", "gpu"], "supports_float16": 1}
[2024-04-18 10:15:12] INFO auto_target.py:103: Found host LLVM triple: x86_64-pc-windows-msvc
[2024-04-18 10:15:12] INFO auto_target.py:104: Found host LLVM CPU: alderlake
[2024-04-18 10:15:12] INFO auto_config.py:153: Found model type: llama. Use `--model-type` to override.
Compiling with arguments:
--config LlamaConfig(hidden_size=4096, intermediate_size=11008, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-06, vocab_size=32000, position_embedding_base=10000, context_window_size=4096, prefill_chunk_size=4096, num_key_value_heads=32, head_dim=128, tensor_parallel_shards=1, max_batch_size=80, kwargs={})
--quantization GroupQuantize(name='q4f16_1', kind='group-quant', group_size=32, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float16', linear_weight_layout='NK', quantize_embedding=True, quantize_final_fc=True, num_elem_per_storage=8, num_storage_per_group=4, max_int_value=7)
--model-type llama
--target {"thread_warp_size": 1, "host": {"mtriple": "x86_64-pc-windows-msvc", "tag": "", "kind": "llvm", "mcpu": "alderlake", "keys": ["cpu"]}, "supports_int16": 1, "supports_float32": T.bool(True), "supports_int32": T.bool(True), "max_threads_per_block": 1024, "supports_int8": 1, "supports_int64": 1, "max_num_threads": 256, "kind": "vulkan", "max_shared_memory_per_block": 32768, "supports_16bit_buffer": 1, "tag": "", "keys": ["vulkan", "gpu"], "supports_float16": 1}
--opt flashinfer=0;cublas_gemm=0;faster_transformer=0;cudagraph=0;cutlass=0;ipc_allreduce_strategy=NONE
--system-lib-prefix ""
--output C:\Users\me\AppData\Local\Temp\tmpv9s4rytg\lib.dll
--overrides context_window_size=4096;sliding_window_size=None;prefill_chunk_size=4096;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=1
[2024-04-18 10:15:12] INFO config.py:106: Overriding context_window_size from 4096 to 4096
[2024-04-18 10:15:12] INFO config.py:106: Overriding prefill_chunk_size from 4096 to 4096
[2024-04-18 10:15:12] INFO config.py:106: Overriding tensor_parallel_shards from 1 to 1
[2024-04-18 10:15:12] INFO compile.py:137: Creating model from: LlamaConfig(hidden_size=4096, intermediate_size=11008, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-06, vocab_size=32000, position_embedding_base=10000, context_window_size=4096, prefill_chunk_size=4096, num_key_value_heads=32, head_dim=128, tensor_parallel_shards=1, max_batch_size=80, kwargs={})
[2024-04-18 10:15:12] INFO compile.py:156: Exporting the model to TVM Unity compiler
[2024-04-18 10:15:17] INFO compile.py:162: Running optimizations using TVM Unity
[2024-04-18 10:15:17] INFO compile.py:176: Registering metadata: {'model_type': 'llama', 'quantization': 'q4f16_1', 'context_window_size': 4096, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 4096, 'tensor_parallel_shards': 1, 'kv_cache_bytes': 0}
[2024-04-18 10:15:18] INFO pipeline.py:50: Running TVM Relax graph-level optimizations
[2024-04-18 10:16:44] INFO pipeline.py:50: Lowering to TVM TIR kernels
[2024-04-18 10:16:50] INFO pipeline.py:50: Running TVM TIR-level optimizations
[2024-04-18 10:17:09] INFO pipeline.py:50: Running TVM Dlight low-level optimizations
[10:17:09] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.
[10:17:09] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.
[10:17:09] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.
[10:17:09] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.
[10:17:09] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.
[10:17:09] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.
[10:17:09] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.
[10:17:10] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.
[10:17:10] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false: Loop variable's dtype (int32) is narrower than that of `min` or `extent` (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.
[10:17:10] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false: Loop variable's dtype (int32) is narrower than that of `min` or `extent` (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.
[10:17:10] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false: Loop variable's dtype (int32) is narrower than that of `min` or `extent` (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.
[10:17:10] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false: Loop variable's dtype (int32) is narrower than that of `min` or `extent` (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.
[10:17:10] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.
[10:17:10] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.
[10:17:10] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.
[10:17:10] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.
[10:17:10] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.
[10:17:10] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.
[10:17:10] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false: Loop variable's dtype (int32) is narrower than that of `min` or `extent` (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.
[10:17:10] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false: Loop variable's dtype (int32) is narrower than that of `min` or `extent` (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.
[10:17:10] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false: Loop variable's dtype (int32) is narrower than that of `min` or `extent` (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.
[10:17:10] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false: Loop variable's dtype (int32) is narrower than that of `min` or `extent` (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.
[10:17:10] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false: Loop variable's dtype (int32) is narrower than that of `min` or `extent` (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.
[10:17:10] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false: Loop variable's dtype (int32) is narrower than that of `min` or `extent` (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.
[10:17:10] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false: Loop variable's dtype (int32) is narrower than that of `min` or `extent` (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.
[10:17:10] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false: Loop variable's dtype (int32) is narrower than that of `min` or `extent` (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.
[2024-04-18 10:17:10] INFO pipeline.py:50: Lowering to VM bytecode
[2024-04-18 10:17:14] INFO estimate_memory_usage.py:57: [Memory usage] Function `alloc_embedding_tensor`: 32.00 MB
[2024-04-18 10:17:15] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_decode`: 9.02 MB
[2024-04-18 10:17:15] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_decode_to_last_hidden_states`: 9.65 MB
[2024-04-18 10:17:15] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_get_logits`: 0.62 MB
[2024-04-18 10:17:15] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_prefill`: 462.62 MB
[2024-04-18 10:17:15] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_prefill_to_last_hidden_states`: 494.00 MB
[2024-04-18 10:17:15] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_select_last_hidden_states`: 0.62 MB
[2024-04-18 10:17:15] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_verify`: 462.00 MB
[2024-04-18 10:17:15] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_verify_to_last_hidden_states`: 494.00 MB
[2024-04-18 10:17:15] INFO estimate_memory_usage.py:57: [Memory usage] Function `create_tir_paged_kv_cache`: 0.00 MB
[2024-04-18 10:17:15] INFO estimate_memory_usage.py:57: [Memory usage] Function `decode`: 0.11 MB
[2024-04-18 10:17:15] INFO estimate_memory_usage.py:57: [Memory usage] Function `decode_to_last_hidden_states`: 0.12 MB
[2024-04-18 10:17:15] INFO estimate_memory_usage.py:57: [Memory usage] Function `embed`: 32.00 MB
[2024-04-18 10:17:15] INFO estimate_memory_usage.py:57: [Memory usage] Function `get_logits`: 0.00 MB
[2024-04-18 10:17:15] INFO estimate_memory_usage.py:57: [Memory usage] Function `prefill`: 462.01 MB
[2024-04-18 10:17:15] INFO estimate_memory_usage.py:57: [Memory usage] Function `prefill_to_last_hidden_states`: 494.00 MB
[2024-04-18 10:17:15] INFO estimate_memory_usage.py:57: [Memory usage] Function `softmax_with_temperature`: 0.00 MB
[2024-04-18 10:17:17] INFO pipeline.py:50: Compiling external modules
[2024-04-18 10:17:17] INFO pipeline.py:50: Compilation complete! Exporting to disk
[2024-04-18 10:17:23] INFO model_metadata.py:96: Total memory usage: 4109.13 MB (Parameters: 3615.13 MB. KVCache: 0.00 MB. Temporary buffer: 494.00 MB)
[2024-04-18 10:17:23] INFO model_metadata.py:105: To reduce memory usage, tweak `prefill_chunk_size`, `context_window_size` and `sliding_window_size`
[2024-04-18 10:17:23] INFO compile.py:198: Generated: C:\Users\me\AppData\Local\Temp\tmpv9s4rytg\lib.dll
Traceback (most recent call last):
File "C:\Users\me\.conda\envs\llm\Lib\site-packages\mlc_llm\chat_module.py", line 772, in __init__
self.model_lib_path = _get_lib_module_path(
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\me\.conda\envs\llm\Lib\site-packages\mlc_llm\chat_module.py", line 591, in _get_lib_module_path
raise FileNotFoundError(err_msg)
FileNotFoundError: Cannot find the model library that corresponds to `None`.
`None` is either provided in the `chat_config` you passed in, or specified in dist\\models\\Llama-2-7b-chat-hf-q4f16_1-MLC\mlc-chat-config.json.
We searched over the following possible paths:
- None-vulkan.dll
- dist/prebuilt/lib/None-vulkan.dll
- dist/dist\\models\\Llama-2-7b-chat-hf-q4f16_1-MLC/None-vulkan.dll
- dist\\models\\Llama-2-7b-chat-hf-q4f16_1-MLC\None-vulkan.dll
- C:\llm\models\mlc\dist\models\None-vulkan.dll
If you would like to directly specify the model library path, you may consider passing in the `ChatModule.model_lib_path` parameter.
Please checkout https://github.com/mlc-ai/notebooks/blob/main/mlc-llm/tutorial_chat_module_getting_started.ipynb for an example on how to load a model.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "C:\Users\me\.conda\envs\llm\Scripts\mlc_llm.exe\__main__.py", line 7, in <module>
File "C:\Users\me\.conda\envs\llm\Lib\site-packages\mlc_llm\__main__.py", line 37, in main
cli.main(sys.argv[2:])
File "C:\Users\me\.conda\envs\llm\Lib\site-packages\mlc_llm\cli\chat.py", line 41, in main
chat(
File "C:\Users\me\.conda\envs\llm\Lib\site-packages\mlc_llm\interface\chat.py", line 133, in chat
cm = ChatModule(model, device, chat_config=config, model_lib_path=model_lib_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\me\.conda\envs\llm\Lib\site-packages\mlc_llm\chat_module.py", line 785, in __init__
jit.jit(
File "C:\Users\me\.conda\envs\llm\Lib\site-packages\mlc_llm\interface\jit.py", line 123, in jit
_run_jit(
File "C:\Users\me\.conda\envs\llm\Lib\site-packages\mlc_llm\interface\jit.py", line 96, in _run_jit
subprocess.run(cmd, check=True)
File "C:\Users\me\.conda\envs\llm\Lib\subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['C:\\Users\\me\\.conda\\envs\\llm\\python.exe', '-m', 'mlc_llm', 'compile', 'dist\\models\\Llama-2-7b-chat-hf-q4f16_1-MLC', '--opt', 'flashinfer=1;cublas_gemm=1;faster_transformer=1;cudagraph=0;cutlass=1;ipc_allreduce_strategy=NONE', '--overrides', 'context_window_size=4096;prefill_chunk_size=4096;tensor_parallel_shards=1', '--device', 'vulkan:0', '--output', 'C:\\Users\\me\\AppData\\Local\\Temp\\tmpv9s4rytg\\lib.dll']' returned non-zero exit status 3221226505.
❓ General Questions
I'm trying to see if MLC can run on Intel iGPU on windows but then no prebuilt Vulkan for Windows is available. I'm curious what is the reason this is not provided. Is there plan to provide this or it is not possible?