Closed taeyeonlee closed 8 months ago
"Phi-2" conversation template was added in #1469, so please use the up-to-date mlc-chat python package.
BTW, there's no need to use prebuilt package as the latest MLC provides on-device JIT compilation. You may run the follow commands to reproduce:
from mlc_chat import ChatConfig, ChatModule, callback
from mlc_chat.support import logging
logging.enable_logging()
MODEL = "HF://junrushao/phi-2-q4f16_1-MLC"
def main():
cm = ChatModule(
MODEL,
device="cuda:0",
chat_config=ChatConfig(context_window_size=1024),
)
cm.generate(
"What is the meaning of life?",
progress_callback=callback.StreamToStdout(callback_interval=2),
)
if __name__ == "__main__":
main()
Thanks for the info. but, your commands failed to run. The error log is following.
$ python3 test.py
Traceback (most recent call last):
File "/home/taeyeonlee/mlc-llm/test.py", line 20, in
Please do upgrade mlc-chat and mlc-ai python package
After upgrading the mlc-chat and mlc-ai python package, (python3 -m pip install --pre -U -f https://mlc.ai/wheels mlc-chat-nightly mlc-ai-nightly --upgrade python3 -m pip install --pre -U -f https://mlc.ai/wheels mlc-ai-nightly --upgrade) Is it right way to upgrade the mlc-chat and mlc-ai python package ?
but, the error is still following.
taeyeonlee@taeyeonlee-15U50Q-SP7PL:~/mlc-llm$ python3 test.py
[2024-01-08 15:40:15] INFO auto_device.py:76: Found device: vulkan:0
[2024-01-08 15:40:15] INFO auto_device.py:76: Found device: vulkan:1
[2024-01-08 15:40:15] INFO chat_module.py:366: Using model folder: /home/taeyeonlee/mlc-llm/dist/phi-2-q4f16_1-MLC
[2024-01-08 15:40:15] INFO chat_module.py:367: Using mlc chat config: /home/taeyeonlee/mlc-llm/dist/phi-2-q4f16_1-MLC/mlc-chat-config.json
[2024-01-08 15:40:15] INFO chat_module.py:756: Model lib not found. Now compiling model lib on device...
[2024-01-08 15:40:15] INFO jit.py:106: Using cached model lib: /home/taeyeonlee/.cache/mlc_chat/model_lib/e7ecd6b7224f29540450080d7628b413.so
[2024-01-08 15:40:15] INFO model_metadata.py:55: Total memory usage: 2121.16 MB (Parameters: 1492.45 MB. KVCache: 320.00 MB. Temporary buffer: 308.71 MB)
[2024-01-08 15:40:15] INFO model_metadata.py:64: To reduce memory usage, tweak prefill_chunk_size
, context_window_size
and sliding_window_size
Traceback (most recent call last):
File "/home/taeyeonlee/mlc-llm/test.py", line 20, in
Looks like you are not using CUDA actually - could you use a CUDA package instead?
After installing CUDA package, the error says that CUDA: out of memory Is there a way to run phi-2 on this laptop (Ubuntu, 16GB RAM, NVIDIA GeForce MX570 2GB) ? the other model (RedPajama-INCITE-Instruct-3B-v1-q4f16_1-MLC) is running on this laptop.
taeyeonlee@taeyeonlee-15U50Q-SP7PL:~/.cache/mlc_chat/model_lib$ nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Wed_Nov_22_10:17:15_PST_2023 Cuda compilation tools, release 12.3, V12.3.107 Build cuda_12.3.r12.3/compiler.33567101_0
taeyeonlee@taeyeonlee-15U50Q-SP7PL:~/mlc-llm$ python3 test.py
[2024-01-08 21:48:56] INFO auto_device.py:76: Found device: cuda:0
[2024-01-08 21:48:56] INFO chat_module.py:366: Using model folder: /home/taeyeonlee/mlc-llm/dist/phi-2-q4f16_1-MLC
[2024-01-08 21:48:56] INFO chat_module.py:367: Using mlc chat config: /home/taeyeonlee/mlc-llm/dist/phi-2-q4f16_1-MLC/mlc-chat-config.json
[2024-01-08 21:48:56] INFO chat_module.py:756: Model lib not found. Now compiling model lib on device...
[2024-01-08 21:48:56] INFO jit.py:106: Using cached model lib: /home/taeyeonlee/.cache/mlc_chat/model_lib/cb0702472eeffb8f3d2c633728960213.so
[2024-01-08 21:48:56] INFO model_metadata.py:55: Total memory usage: 3043.16 MB (Parameters: 1492.45 MB. KVCache: 640.00 MB. Temporary buffer: 910.71 MB)
[2024-01-08 21:48:56] INFO model_metadata.py:64: To reduce memory usage, tweak prefill_chunk_size
, context_window_size
and sliding_window_size
Traceback (most recent call last):
File "/home/taeyeonlee/mlc-llm/test.py", line 20, in
I think those two lines in logging could be helpful
[2024-01-08 15:40:15] INFO model_metadata.py:55: Total memory usage: 3043.16 MB (Parameters: 1492.45 MB. KVCache: 640.00 MB. Temporary buffer: 910.71 MB)
[2024-01-08 21:48:56] INFO model_metadata.py:64: To reduce memory usage, tweak prefill_chunk_size, context_window_size and sliding_window_size
Can you lower prefill_chunk_size and context_window_size to something smaller (e.g. 512)?
When using lower prefill_chunk_size and context_window_size (chat_config=ChatConfig(prefill_chunk_size=128, context_window_size=128), it still out of memory. I'll try Phi-2 on the other PC which has larger Memory (32GB).
taeyeonlee@taeyeonlee-15U50Q-SP7PL:~/mlc-llm$ python3 test.py
[2024-01-09 13:56:43] INFO auto_device.py:76: Found device: cuda:0
[2024-01-09 13:56:43] INFO auto_device.py:85: Not found device: rocm:0
[2024-01-09 13:56:43] INFO auto_device.py:85: Not found device: metal:0
[2024-01-09 13:56:44] INFO auto_device.py:76: Found device: vulkan:0
[2024-01-09 13:56:44] INFO auto_device.py:76: Found device: vulkan:1
[2024-01-09 13:56:44] INFO auto_device.py:76: Found device: vulkan:2
[2024-01-09 13:56:44] INFO auto_device.py:85: Not found device: opencl:0
[2024-01-09 13:56:44] INFO auto_device.py:33: Using device: cuda:0
[2024-01-09 13:56:44] INFO chat_module.py:366: Using model folder: /home/taeyeonlee/mlc-llm/dist/phi-2-q4f16_1-MLC
[2024-01-09 13:56:44] INFO chat_module.py:367: Using mlc chat config: /home/taeyeonlee/mlc-llm/dist/phi-2-q4f16_1-MLC/mlc-chat-config.json
[2024-01-09 13:56:44] INFO chat_module.py:756: Model lib not found. Now compiling model lib on device...
[2024-01-09 13:56:44] INFO jit.py:83: Compiling using commands below:
[2024-01-09 13:56:44] INFO jit.py:84: /usr/bin/python3 -m mlc_chat compile dist/phi-2-q4f16_1-MLC --opt 'flashinfer=1;cublas_gemm=1;cudagraph=0' --overrides 'context_window_size=128;prefill_chunk_size=128;tensor_parallel_shards=1' --device cuda:0 --output /tmp/tmpwqfogw4y/lib.so
[2024-01-09 13:56:44] INFO auto_config.py:69: Found model configuration: dist/phi-2-q4f16_1-MLC/mlc-chat-config.json
[2024-01-09 13:56:44] INFO auto_target.py:75: Detecting target device: cuda:0
[2024-01-09 13:56:44] INFO auto_target.py:77: Found target: {"thread_warp_size": 32, "arch": "sm_86", "max_threads_per_block": 1024, "max_num_threads": 1024, "kind": "cuda", "max_shared_memory_per_block": 49152, "tag": "", "keys": ["cuda", "gpu"]}
[2024-01-09 13:56:44] INFO auto_target.py:94: Found host LLVM triple: x86_64-redhat-linux-gnu
[2024-01-09 13:56:44] INFO auto_target.py:95: Found host LLVM CPU: alderlake
[2024-01-09 13:56:44] INFO auto_target.py:242: Generating code for CUDA architecture: sm_86
[2024-01-09 13:56:44] INFO auto_target.py:243: To produce multi-arch fatbin, set environment variable MLC_MULTI_ARCH. Example: MLC_MULTI_ARCH=70,72,75,80,86,87,89,90
[2024-01-09 13:56:44] INFO auto_config.py:151: Found model type: phi-msft. Use --model-type
to override.
Compiling with arguments:
--config PhiConfig(vocab_size=51200, n_positions=2048, n_embd=2560, n_layer=32, n_inner=10240, n_head=32, rotary_dim=32, position_embedding_base=10000, layer_norm_epsilon=1e-05, context_window_size=2048, prefill_chunk_size=2048, n_head_kv=32, head_dim=80, tensor_parallel_shards=1, kwargs={})
--quantization GroupQuantize(name='q4f16_1', kind='group-quant', group_size=32, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float16', num_elem_per_storage=8, num_storage_per_group=4, max_int_value=7)
--model-type phi-msft
--target {"thread_warp_size": 32, "host": {"mtriple": "x86_64-redhat-linux-gnu", "tag": "", "kind": "llvm", "mcpu": "alderlake", "keys": ["cpu"]}, "arch": "sm_86", "max_threads_per_block": 1024, "max_num_threads": 1024, "kind": "cuda", "max_shared_memory_per_block": 49152, "tag": "", "keys": ["cuda", "gpu"]}
--opt flashinfer=1;cublas_gemm=0;cudagraph=0
--system-lib-prefix ""
--output /tmp/tmpwqfogw4y/lib.so
--overrides context_window_size=128;sliding_window_size=None;prefill_chunk_size=128;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=1
[2024-01-09 13:56:44] INFO compiler_flags.py:118: Overriding context_window_size from 2048 to 128
[2024-01-09 13:56:44] INFO compiler_flags.py:118: Overriding prefill_chunk_size from 2048 to 128
[2024-01-09 13:56:44] INFO compiler_flags.py:118: Overriding tensor_parallel_shards from 1 to 1
[2024-01-09 13:56:44] INFO compile.py:131: Creating model from: PhiConfig(vocab_size=51200, n_positions=2048, n_embd=2560, n_layer=32, n_inner=10240, n_head=32, rotary_dim=32, position_embedding_base=10000, layer_norm_epsilon=1e-05, context_window_size=2048, prefill_chunk_size=2048, n_head_kv=32, head_dim=80, tensor_parallel_shards=1, kwargs={})
[2024-01-09 13:56:45] INFO compile.py:141: Exporting the model to TVM Unity compiler
[2024-01-09 13:56:45] WARNING attention.py:108: FlashInfer only head_dim in [128], but got 80
[2024-01-09 13:56:45] INFO compile.py:147: Running optimizations using TVM Unity
[2024-01-09 13:56:45] INFO compile.py:160: Registering metadata: {'model_type': 'phi-msft', 'quantization': 'q4f16_1', 'context_window_size': 128, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 128, 'tensor_parallel_shards': 1, 'kv_cache_bytes': 41943040}
[2024-01-09 13:56:45] INFO pipeline.py:35: Running TVM Relax graph-level optimizations
[2024-01-09 13:56:46] INFO pipeline.py:35: Lowering to TVM TIR kernels
[2024-01-09 13:56:47] INFO pipeline.py:35: Running TVM TIR-level optimizations
[2024-01-09 13:56:52] INFO pipeline.py:35: Running TVM Dlight low-level optimizations
[2024-01-09 13:56:56] INFO pipeline.py:35: Lowering to VM bytecode
[2024-01-09 13:56:57] INFO estimate_memory_usage.py:55: [Memory usage] Function _initialize_effect
: 0.00 MB
[2024-01-09 13:56:57] INFO estimate_memory_usage.py:55: [Memory usage] Function decode
: 13.39 MB
[2024-01-09 13:56:57] INFO estimate_memory_usage.py:55: [Memory usage] Function prefill
: 21.61 MB
[2024-01-09 13:56:57] INFO estimate_memory_usage.py:55: [Memory usage] Function softmax_with_temperature
: 0.00 MB
[2024-01-09 13:56:57] INFO pipeline.py:35: Compiling external modules
[2024-01-09 13:56:57] INFO pipeline.py:35: Compilation complete! Exporting to disk
[2024-01-09 13:57:01] INFO compile.py:175: Generated: /tmp/tmpwqfogw4y/lib.so
[2024-01-09 13:57:01] INFO jit.py:87: Using compiled model lib: /home/taeyeonlee/.cache/mlc_chat/model_lib/50d8c79ac3552d1cd17020682c4c9164.so
[2024-01-09 13:57:02] INFO model_metadata.py:55: Total memory usage: 1554.06 MB (Parameters: 1492.45 MB. KVCache: 40.00 MB. Temporary buffer: 21.61 MB)
[2024-01-09 13:57:02] INFO model_metadata.py:64: To reduce memory usage, tweak prefill_chunk_size
, context_window_size
and sliding_window_size
Traceback (most recent call last):
File "/home/taeyeonlee/mlc-llm/test.py", line 20, in
When using Phi-1.5 on this laptop (16GB RAM) with the lower prefill_chunk_size and context_window_size (chat_config=ChatConfig(prefill_chunk_size=128, context_window_size=128), it's working, even though the decoded answer is not satisfied.
taeyeonlee@taeyeonlee-15U50Q-SP7PL:~/mlc-llm$ python3 test.py
[2024-01-10 11:33:48] INFO auto_device.py:76: Found device: cuda:0
[2024-01-10 11:33:48] INFO auto_device.py:85: Not found device: rocm:0
[2024-01-10 11:33:48] INFO auto_device.py:85: Not found device: metal:0
[2024-01-10 11:33:49] INFO auto_device.py:76: Found device: vulkan:0
[2024-01-10 11:33:49] INFO auto_device.py:76: Found device: vulkan:1
[2024-01-10 11:33:49] INFO auto_device.py:76: Found device: vulkan:2
[2024-01-10 11:33:49] INFO auto_device.py:85: Not found device: opencl:0
[2024-01-10 11:33:49] INFO auto_device.py:33: Using device: cuda:0
[2024-01-10 11:33:49] INFO chat_module.py:366: Using model folder: /home/taeyeonlee/mlc-llm/dist/phi-1_5-q4f16_1-MLC
[2024-01-10 11:33:49] INFO chat_module.py:367: Using mlc chat config: /home/taeyeonlee/mlc-llm/dist/phi-1_5-q4f16_1-MLC/mlc-chat-config.json
[2024-01-10 11:33:49] INFO chat_module.py:756: Model lib not found. Now compiling model lib on device...
[2024-01-10 11:33:49] INFO jit.py:83: Compiling using commands below:
[2024-01-10 11:33:49] INFO jit.py:84: /usr/bin/python3 -m mlc_chat compile dist/phi-1_5-q4f16_1-MLC --opt 'flashinfer=1;cublas_gemm=1;cudagraph=0' --overrides 'context_window_size=128;prefill_chunk_size=128;tensor_parallel_shards=1' --device cuda:0 --output /tmp/tmpu9mfl9ps/lib.so
[2024-01-10 11:33:49] INFO auto_config.py:69: Found model configuration: dist/phi-1_5-q4f16_1-MLC/mlc-chat-config.json
[2024-01-10 11:33:49] INFO auto_target.py:75: Detecting target device: cuda:0
[2024-01-10 11:33:49] INFO auto_target.py:77: Found target: {"thread_warp_size": 32, "arch": "sm_86", "max_threads_per_block": 1024, "max_num_threads": 1024, "kind": "cuda", "max_shared_memory_per_block": 49152, "tag": "", "keys": ["cuda", "gpu"]}
[2024-01-10 11:33:49] INFO auto_target.py:94: Found host LLVM triple: x86_64-redhat-linux-gnu
[2024-01-10 11:33:49] INFO auto_target.py:95: Found host LLVM CPU: alderlake
[2024-01-10 11:33:49] INFO auto_target.py:242: Generating code for CUDA architecture: sm_86
[2024-01-10 11:33:49] INFO auto_target.py:243: To produce multi-arch fatbin, set environment variable MLC_MULTI_ARCH. Example: MLC_MULTI_ARCH=70,72,75,80,86,87,89,90
[2024-01-10 11:33:49] INFO auto_config.py:151: Found model type: phi-msft. Use --model-type
to override.
Compiling with arguments:
--config PhiConfig(vocab_size=51200, n_positions=2048, n_embd=2048, n_layer=24, n_inner=8192, n_head=32, rotary_dim=32, position_embedding_base=10000, layer_norm_epsilon=1e-05, context_window_size=2048, prefill_chunk_size=2048, n_head_kv=32, head_dim=64, tensor_parallel_shards=1, kwargs={})
--quantization GroupQuantize(name='q4f16_1', kind='group-quant', group_size=32, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float16', num_elem_per_storage=8, num_storage_per_group=4, max_int_value=7)
--model-type phi-msft
--target {"thread_warp_size": 32, "host": {"mtriple": "x86_64-redhat-linux-gnu", "tag": "", "kind": "llvm", "mcpu": "alderlake", "keys": ["cpu"]}, "arch": "sm_86", "max_threads_per_block": 1024, "max_num_threads": 1024, "kind": "cuda", "max_shared_memory_per_block": 49152, "tag": "", "keys": ["cuda", "gpu"]}
--opt flashinfer=1;cublas_gemm=0;cudagraph=0
--system-lib-prefix ""
--output /tmp/tmpu9mfl9ps/lib.so
--overrides context_window_size=128;sliding_window_size=None;prefill_chunk_size=128;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=1
[2024-01-10 11:33:49] INFO compiler_flags.py:118: Overriding context_window_size from 2048 to 128
[2024-01-10 11:33:49] INFO compiler_flags.py:118: Overriding prefill_chunk_size from 2048 to 128
[2024-01-10 11:33:49] INFO compiler_flags.py:118: Overriding tensor_parallel_shards from 1 to 1
[2024-01-10 11:33:49] INFO compile.py:131: Creating model from: PhiConfig(vocab_size=51200, n_positions=2048, n_embd=2048, n_layer=24, n_inner=8192, n_head=32, rotary_dim=32, position_embedding_base=10000, layer_norm_epsilon=1e-05, context_window_size=2048, prefill_chunk_size=2048, n_head_kv=32, head_dim=64, tensor_parallel_shards=1, kwargs={})
[2024-01-10 11:33:49] INFO compile.py:141: Exporting the model to TVM Unity compiler
[2024-01-10 11:33:49] WARNING attention.py:108: FlashInfer only head_dim in [128], but got 64
[2024-01-10 11:33:50] INFO compile.py:147: Running optimizations using TVM Unity
[2024-01-10 11:33:50] INFO compile.py:160: Registering metadata: {'model_type': 'phi-msft', 'quantization': 'q4f16_1', 'context_window_size': 128, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 128, 'tensor_parallel_shards': 1, 'kv_cache_bytes': 25165824}
[2024-01-10 11:33:50] INFO pipeline.py:35: Running TVM Relax graph-level optimizations
[2024-01-10 11:33:51] INFO pipeline.py:35: Lowering to TVM TIR kernels
[2024-01-10 11:33:51] INFO pipeline.py:35: Running TVM TIR-level optimizations
[2024-01-10 11:33:54] INFO pipeline.py:35: Running TVM Dlight low-level optimizations
[2024-01-10 11:33:59] INFO pipeline.py:35: Lowering to VM bytecode
[2024-01-10 11:33:59] INFO estimate_memory_usage.py:55: [Memory usage] Function _initialize_effect
: 0.00 MB
[2024-01-10 11:33:59] INFO estimate_memory_usage.py:55: [Memory usage] Function decode
: 8.77 MB
[2024-01-10 11:33:59] INFO estimate_memory_usage.py:55: [Memory usage] Function prefill
: 15.73 MB
[2024-01-10 11:33:59] INFO estimate_memory_usage.py:55: [Memory usage] Function softmax_with_temperature
: 0.00 MB
[2024-01-10 11:33:59] INFO pipeline.py:35: Compiling external modules
[2024-01-10 11:33:59] INFO pipeline.py:35: Compilation complete! Exporting to disk
[2024-01-10 11:34:03] INFO compile.py:175: Generated: /tmp/tmpu9mfl9ps/lib.so
[2024-01-10 11:34:03] INFO jit.py:87: Using compiled model lib: /home/taeyeonlee/.cache/mlc_chat/model_lib/c7e0024fece289d69d5426677e94dfbb.so
[2024-01-10 11:34:04] INFO model_metadata.py:55: Total memory usage: 801.37 MB (Parameters: 761.64 MB. KVCache: 24.00 MB. Temporary buffer: 15.73 MB)
[2024-01-10 11:34:04] INFO model_metadata.py:64: To reduce memory usage, tweak prefill_chunk_size
, context_window_size
and sliding_window_size
[11:34:04] /workspace/mlc-llm/cpp/llm_chat.cc:705: Warning: The prompt tokens are too long and the generated text may be incomplete, due to limited max_window_size
.
A:
You can use cv2.inRange with an array of gray levels.
import cv2 import numpy as np
image = cv2.imread('my_image.png', cv2.IMREAD_GRAYSCALE) gray_levels = [0, 10, 20]
mask = cv2.inRange(image, np.min(gray_levels), np.max(gray_levels))
This will give you a
Thanks for your support. @junrushao When I try Phi-2 on the PC (Ubuntu + RTX2060 12GB + CPU 32GB), it's working well. The LLM model needs the GPU Memory (3043.16 MB), according to the below log.
taeyeon@taeyeon-ubuntu-pc:~/mlc-llm$ python3 test.py
[2024-01-16 23:00:01] INFO auto_device.py:76: Found device: vulkan:0
[2024-01-16 23:00:01] INFO auto_device.py:76: Found device: vulkan:1
[2024-01-16 23:00:01] INFO chat_module.py:370: Using model folder: /home/taeyeon/mlc-llm/dist/phi-2-MLC
[2024-01-16 23:00:01] INFO chat_module.py:371: Using mlc chat config: /home/taeyeon/mlc-llm/dist/phi-2-MLC/mlc-chat-config.json
[2024-01-16 23:00:01] INFO chat_module.py:760: Model lib not found. Now compiling model lib on device...
/home/taeyeon/.local/lib/python3.10/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
setattr(self, word, getattr(machar, word).flat[0])
/home/taeyeon/.local/lib/python3.10/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
return self._float_to_str(self.smallest_subnormal)
/home/taeyeon/.local/lib/python3.10/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
setattr(self, word, getattr(machar, word).flat[0])
/home/taeyeon/.local/lib/python3.10/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
return self._float_to_str(self.smallest_subnormal)
[2024-01-16 23:00:02] INFO jit.py:106: Using cached model lib: /home/taeyeon/.cache/mlc_chat/model_lib/5f9614a7f67e3d57981ec0c3b3e17ce5.so
[2024-01-16 23:00:02] INFO model_metadata.py:55: Total memory usage: 3043.16 MB (Parameters: 1492.45 MB. KVCache: 640.00 MB. Temporary buffer: 910.71 MB)
[2024-01-16 23:00:02] INFO model_metadata.py:64: To reduce memory usage, tweak prefill_chunk_size
, context_window_size
and sliding_window_size
Tue Jan 16 23:00:03 2024
Phi-2 is a Transformer-based model that can be used to generate human-like text. It has been trained on a mixture of Synthetic and Web datasets for NLP and programming tasks.
Example 4: Language Modeling with GPT-2 Tue Jan 16 23:00:05 2024 taeyeon@taeyeon-ubuntu-pc:~/mlc-llm$
🐛 Bug
To Reproduce
Steps to reproduce the behavior: Hi, When using the precompiled binary and Weights for Phi-2, the error is following.Could you share how to use it ?
Precompiled Binary Files : https://github.com/mlc-ai/binary-mlc-llm-libs/blob/main/phi-2/phi-2-q4f16_1-vulkan.so Precompiled Weights : https://huggingface.co/mlc-ai/phi-2-q4f16_1-MLC
The Error log Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux Type "help", "copyright", "credits" or "license" for more information.
Environment