mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
18.76k stars 1.53k forks source link

[Bug] mlc-llm can't use the model (Phi-2 : https://huggingface.co/mlc-ai/phi-2-q4f16_1-MLC) on Ubuntu 22.04 ? #1556

Closed taeyeonlee closed 8 months ago

taeyeonlee commented 8 months ago

🐛 Bug

To Reproduce

Steps to reproduce the behavior: Hi, When using the precompiled binary and Weights for Phi-2,  the error is following.Could you share how to use it ?

Precompiled Binary Files : https://github.com/mlc-ai/binary-mlc-llm-libs/blob/main/phi-2/phi-2-q4f16_1-vulkan.so Precompiled Weights : https://huggingface.co/mlc-ai/phi-2-q4f16_1-MLC

The Error log Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux Type "help", "copyright", "credits" or "license" for more information.

from mlc_chat import ChatModule cm = ChatModule(model="dist/phi-2-q4f16_1-MLC", model_lib_path="dist/libs/phi-2-q4f16_1-vulkan.so") Traceback (most recent call last):   File "", line 1, in   File "/home/taeyeonlee/.local/lib/python3.10/site-packages/mlc_chat/chat_module.py", line 713, in init     self._reload(self.model_lib_path, self.model_path, user_chat_config_json_str)   File "/home/taeyeonlee/.local/lib/python3.10/site-packages/mlc_chat/chat_module.py", line 927, in _reload     self._reload_func(lib, model_path, app_config_json)   File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.call   File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall   File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3   File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL   File "/home/taeyeonlee/.local/lib/python3.10/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error     raise py_err   File "/workspace/mlc-llm/cpp/llm_chat.cc", line 1535, in mlc::llm::LLMChatModule::GetFunction(tvm::runtime::String const&, tvm::runtime::ObjectPtr const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue)#1}::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue) const   File "/workspace/mlc-llm/cpp/llm_chat.cc", line 549, in mlc::llm::LLMChat::Reload(tvm::runtime::TVMArgValue, tvm::runtime::String, tvm::runtime::String)   File "/workspace/mlc-llm/cpp/llm_chat.cc", line 529, in mlc::llm::LLMChat::LoadJSONOverride(std::cxx11::basic_string<char, std::char_traits, std::allocator > const&, bool)   File "/workspace/mlc-llm/cpp/llm_chat.cc", line 501, in mlc::llm::LLMChat::LoadJSONOverride(picojson::value const&, bool)   File "/workspace/mlc-llm/cpp/conv_templates.cc", line 702, in mlc::llm::Conversation::FromTemplate(std::cxx11::basic_string<char, std::char_traits, std::allocator > const&) tvm._ffi.base.TVMError: Traceback (most recent call last):   4: mlc::llm::LLMChatModule::GetFunction(tvm::runtime::String const&, tvm::runtime::ObjectPtr const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue)#1}::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue) const         at /workspace/mlc-llm/cpp/llm_chat.cc:1535   3: mlc::llm::LLMChat::Reload(tvm::runtime::TVMArgValue, tvm::runtime::String, tvm::runtime::String)         at /workspace/mlc-llm/cpp/llm_chat.cc:549   2: mlc::llm::LLMChat::LoadJSONOverride(std::cxx11::basic_string<char, std::char_traits, std::allocator > const&, bool)         at /workspace/mlc-llm/cpp/llm_chat.cc:529   1: mlc::llm::LLMChat::LoadJSONOverride(picojson::value const&, bool)         at /workspace/mlc-llm/cpp/llm_chat.cc:501   0: mlc::llm::Conversation::FromTemplate(std::cxx11::basic_string<char, std::char_traits, std::allocator > const&)         at /workspace/mlc-llm/cpp/conv_templates.cc:702   File "/workspace/mlc-llm/cpp/conv_templates.cc", line 702 TVMError: Unknown conversation template: phi-2

Environment

junrushao commented 8 months ago

"Phi-2" conversation template was added in #1469, so please use the up-to-date mlc-chat python package.

BTW, there's no need to use prebuilt package as the latest MLC provides on-device JIT compilation. You may run the follow commands to reproduce:

from mlc_chat import ChatConfig, ChatModule, callback
from mlc_chat.support import logging
logging.enable_logging()

MODEL = "HF://junrushao/phi-2-q4f16_1-MLC"

def main():
    cm = ChatModule(
        MODEL,
        device="cuda:0",
        chat_config=ChatConfig(context_window_size=1024),
    )
    cm.generate(
        "What is the meaning of life?",
        progress_callback=callback.StreamToStdout(callback_interval=2),
    )

if __name__ == "__main__":
    main()
taeyeonlee commented 8 months ago

Thanks for the info. but, your commands failed to run. The error log is following.

$ python3 test.py Traceback (most recent call last): File "/home/taeyeonlee/mlc-llm/test.py", line 20, in main() File "/home/taeyeonlee/mlc-llm/test.py", line 11, in main chat_config=ChatConfig(context_window_size=1024), TypeError: ChatConfig.init() got an unexpected keyword argument 'context_window_size'

junrushao commented 8 months ago

Please do upgrade mlc-chat and mlc-ai python package

taeyeonlee commented 8 months ago

After upgrading the mlc-chat and mlc-ai python package, (python3 -m pip install --pre -U -f https://mlc.ai/wheels mlc-chat-nightly mlc-ai-nightly --upgrade python3 -m pip install --pre -U -f https://mlc.ai/wheels mlc-ai-nightly --upgrade) Is it right way to upgrade the mlc-chat and mlc-ai python package ?

but, the error is still following.

taeyeonlee@taeyeonlee-15U50Q-SP7PL:~/mlc-llm$ python3 test.py [2024-01-08 15:40:15] INFO auto_device.py:76: Found device: vulkan:0 [2024-01-08 15:40:15] INFO auto_device.py:76: Found device: vulkan:1 [2024-01-08 15:40:15] INFO chat_module.py:366: Using model folder: /home/taeyeonlee/mlc-llm/dist/phi-2-q4f16_1-MLC [2024-01-08 15:40:15] INFO chat_module.py:367: Using mlc chat config: /home/taeyeonlee/mlc-llm/dist/phi-2-q4f16_1-MLC/mlc-chat-config.json [2024-01-08 15:40:15] INFO chat_module.py:756: Model lib not found. Now compiling model lib on device... [2024-01-08 15:40:15] INFO jit.py:106: Using cached model lib: /home/taeyeonlee/.cache/mlc_chat/model_lib/e7ecd6b7224f29540450080d7628b413.so [2024-01-08 15:40:15] INFO model_metadata.py:55: Total memory usage: 2121.16 MB (Parameters: 1492.45 MB. KVCache: 320.00 MB. Temporary buffer: 308.71 MB) [2024-01-08 15:40:15] INFO model_metadata.py:64: To reduce memory usage, tweak prefill_chunk_size, context_window_size and sliding_window_size Traceback (most recent call last): File "/home/taeyeonlee/mlc-llm/test.py", line 20, in main() File "/home/taeyeonlee/mlc-llm/test.py", line 13, in main cm.generate( File "/home/taeyeonlee/.local/lib/python3.10/site-packages/mlc_chat/chat_module.py", line 841, in generate self._prefill(prompt, generation_config=generation_config) File "/home/taeyeonlee/.local/lib/python3.10/site-packages/mlc_chat/chat_module.py", line 1058, in _prefill self._prefill_func( File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.call File "tvm/_ffi/_cython/./packed_func.pxi", line 277, in tvm._ffi._cy3.core.FuncCall File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL File "/home/taeyeonlee/.local/lib/python3.10/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error raise py_err File "/workspace/mlc-llm/cpp/llm_chat.cc", line 1576, in mlc::llm::LLMChatModule::GetFunction(tvm::runtime::String const&, tvm::runtime::ObjectPtr const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue)#5}::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue) const File "/workspace/mlc-llm/cpp/llm_chat.cc", line 916, in mlc::llm::LLMChat::PrefillStep(std::cxx11::basic_string<char, std::char_traits, std::allocator >, bool, bool, mlc::llm::PlaceInPrompt, tvm::runtime::String) File "/workspace/mlc-llm/cpp/llm_chat.cc", line 1167, in mlc::llm::LLMChat::SampleTokenFromLogits(tvm::runtime::NDArray, std::unordered_map<std::cxx11::basic_string<char, std::char_traits, std::allocator >, picojson::value, std::hash<std::cxx11::basic_string<char, std::char_traits, std::allocator > >, std::equal_to<std::cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::cxx11::basic_string<char, std::char_traits, std::allocator > const, picojson::value> > >) File "/workspace/mlc-llm/cpp/llm_chat.cc", line 1422, in mlc::llm::LLMChat::SampleFromProbOnCPU(float) tvm._ffi.base.TVMError: Traceback (most recent call last): 7: mlc::llm::LLMChatModule::GetFunction(tvm::runtime::String const&, tvm::runtime::ObjectPtr const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue)#5}::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue) const at /workspace/mlc-llm/cpp/llm_chat.cc:1576 6: mlc::llm::LLMChat::PrefillStep(std::cxx11::basic_string<char, std::char_traits, std::allocator >, bool, bool, mlc::llm::PlaceInPrompt, tvm::runtime::String) at /workspace/mlc-llm/cpp/llm_chat.cc:916 5: mlc::llm::LLMChat::SampleTokenFromLogits(tvm::runtime::NDArray, std::unordered_map<std::cxx11::basic_string<char, std::char_traits, std::allocator >, picojson::value, std::hash<std::cxx11::basic_string<char, std::char_traits, std::allocator > >, std::equal_to<std::cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::cxx11::basic_string<char, std::char_traits, std::allocator > const, picojson::value> > >) at /workspace/mlc-llm/cpp/llm_chat.cc:1167 4: mlc::llm::LLMChat::SampleFromProbOnCPU(float) at /workspace/mlc-llm/cpp/llm_chat.cc:1422 3: _ZN3tvm7runtime13PackedFun 2: tvm::runtime::TypedPackedFunc<int (tvm::runtime::NDArray, double, double)>::AssignTypedLambda<int ()(tvm::runtime::NDArray, double, double)>(int ()(tvm::runtime::NDArray, double, double), std::__cxx11::basic_string<char, std::char_traits, std::allocator >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue)#1}::operator()(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue) const 1: tvm::runtime::relax_vm::SampleTopPFromProb(tvm::runtime::NDArray, double, double) 0: _ZN3tvm7runtime6deta File "/workspace/tvm/src/runtime/relax_vm/lm_support.cc", line 487 TVMError: The output probabilities are all NaNs, can not sample from it

junrushao commented 8 months ago

Looks like you are not using CUDA actually - could you use a CUDA package instead?

taeyeonlee commented 8 months ago

After installing CUDA package, the error says that CUDA: out of memory Is there a way to run phi-2 on this laptop (Ubuntu, 16GB RAM, NVIDIA GeForce MX570 2GB) ? the other model (RedPajama-INCITE-Instruct-3B-v1-q4f16_1-MLC) is running on this laptop.

taeyeonlee@taeyeonlee-15U50Q-SP7PL:~/.cache/mlc_chat/model_lib$ nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Wed_Nov_22_10:17:15_PST_2023 Cuda compilation tools, release 12.3, V12.3.107 Build cuda_12.3.r12.3/compiler.33567101_0

taeyeonlee@taeyeonlee-15U50Q-SP7PL:~/mlc-llm$ python3 test.py [2024-01-08 21:48:56] INFO auto_device.py:76: Found device: cuda:0 [2024-01-08 21:48:56] INFO chat_module.py:366: Using model folder: /home/taeyeonlee/mlc-llm/dist/phi-2-q4f16_1-MLC [2024-01-08 21:48:56] INFO chat_module.py:367: Using mlc chat config: /home/taeyeonlee/mlc-llm/dist/phi-2-q4f16_1-MLC/mlc-chat-config.json [2024-01-08 21:48:56] INFO chat_module.py:756: Model lib not found. Now compiling model lib on device... [2024-01-08 21:48:56] INFO jit.py:106: Using cached model lib: /home/taeyeonlee/.cache/mlc_chat/model_lib/cb0702472eeffb8f3d2c633728960213.so [2024-01-08 21:48:56] INFO model_metadata.py:55: Total memory usage: 3043.16 MB (Parameters: 1492.45 MB. KVCache: 640.00 MB. Temporary buffer: 910.71 MB) [2024-01-08 21:48:56] INFO model_metadata.py:64: To reduce memory usage, tweak prefill_chunk_size, context_window_size and sliding_window_size Traceback (most recent call last): File "/home/taeyeonlee/mlc-llm/test.py", line 20, in main() File "/home/taeyeonlee/mlc-llm/test.py", line 9, in main cm = ChatModule( File "/home/taeyeonlee/.local/lib/python3.10/site-packages/mlc_chat/chat_module.py", line 774, in init self._reload(self.model_lib_path, self.model_path, user_chat_config_json_str) File "/home/taeyeonlee/.local/lib/python3.10/site-packages/mlc_chat/chat_module.py", line 988, in _reload self._reload_func(lib, model_path, app_config_json) File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.call File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3 File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL File "/home/taeyeonlee/.local/lib/python3.10/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error raise py_err File "/workspace/mlc-llm/cpp/llm_chat.cc", line 1541, in mlc::llm::LLMChatModule::GetFunction(tvm::runtime::String const&, tvm::runtime::ObjectPtr const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue)#1}::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue) const File "/workspace/mlc-llm/cpp/llm_chat.cc", line 575, in mlc::llm::LLMChat::Reload(tvm::runtime::TVMArgValue, tvm::runtime::String, tvm::runtime::String) File "/workspace/mlc-llm/cpp/llm_chat.cc", line 205, in LoadParams ValueError: Traceback (most recent call last): 5: mlc::llm::LLMChatModule::GetFunction(tvm::runtime::String const&, tvm::runtime::ObjectPtr const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue)#1}::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue) const at /workspace/mlc-llm/cpp/llm_chat.cc:1541 4: mlc::llm::LLMChat::Reload(tvm::runtime::TVMArgValue, tvm::runtime::String, tvm::runtime::String) at /workspace/mlc-llm/cpp/llm_chat.cc:575 3: LoadParams at /workspace/mlc-llm/cpp/llm_chat.cc:205 2: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (std::cxx11::basic_string<char, std::char_traits, std::allocator > const&, int, int)>::AssignTypedLambda<void (*)(std::cxx11::basic_string<char, std::char_traits, std::allocator > const&, int, int)>(void ()(std::cxx11::basic_string<char, std::char_traits, std::allocator > const&, int, int), std::cxx11::basic_string<char, std::char_traits, std::allocator >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue)#1}> >::Call(tvm::runtime::PackedFuncObj const, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue) 1: tvm::runtime::relax_vm::NDArrayCache::Load(std::cxx11::basic_string<char, std::char_traits, std::allocator > const&, int, int) 0: _ZN3tvm7runtime6deta 10: mlc::llm::LLMChatModule::GetFunction(tvm::runtime::String const&, tvm::runtime::ObjectPtr const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue)#1}::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue) const at /workspace/mlc-llm/cpp/llm_chat.cc:1541 9: mlc::llm::LLMChat::Reload(tvm::runtime::TVMArgValue, tvm::runtime::String, tvm::runtime::String) at /workspace/mlc-llm/cpp/llm_chat.cc:575 8: LoadParams at /workspace/mlc-llm/cpp/llm_chat.cc:205 7: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (std::cxx11::basic_string<char, std::char_traits, std::allocator > const&, int, int)>::AssignTypedLambda<void ()(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, int, int)>(void ()(std::cxx11::basic_string<char, std::char_traits, std::allocator > const&, int, int), std::cxx11::basic_string<char, std::char_traits, std::allocator >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue)#1}> >::Call(tvm::runtime::PackedFuncObj const, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue) 6: tvm::runtime::relax_vm::NDArrayCache::Load(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, int, int) 5: tvm::runtime::relax_vm::NDArrayCacheMetadata::FileRecord::Load(DLDevice, std::cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::cxx11::basic_string<char, std::char_traits, std::allocator >, tvm::runtime::Optional) const 4: tvm::runtime::relax_vm::NDArrayCacheMetadata::FileRecord::ParamRecord::Load(DLDevice, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, tvm::runtime::Optional) const 3: tvm::runtime::NDArray::Empty(tvm::runtime::ShapeTuple, DLDataType, DLDevice, tvm::runtime::Optional) 2: tvm::runtime::DeviceAPI::AllocDataSpace(DLDevice, int, long const, DLDataType, tvm::runtime::Optional) 1: tvm::runtime::CUDADeviceAPI::AllocDataSpace(DLDevice, unsigned long, unsigned long, DLDataType) 0: _ZN3tvm7runtime6deta File "/workspace/tvm/src/runtime/relax_vm/ndarray_cache_support.cc", line 255 ValueError: Error when loading parameters from params_shard_46.bin: [21:48:57] /workspace/tvm/src/runtime/cuda/cuda_device_api.cc:129: InternalError: Check failed: (e == cudaSuccess || e == cudaErrorCudartUnloading) is false: CUDA: out of memory

junrushao commented 8 months ago

I think those two lines in logging could be helpful

[2024-01-08 15:40:15] INFO model_metadata.py:55: Total memory usage: 3043.16 MB (Parameters: 1492.45 MB. KVCache: 640.00 MB. Temporary buffer: 910.71 MB)
[2024-01-08 21:48:56] INFO model_metadata.py:64: To reduce memory usage, tweak prefill_chunk_size, context_window_size and sliding_window_size

Can you lower prefill_chunk_size and context_window_size to something smaller (e.g. 512)?

taeyeonlee commented 8 months ago

When using lower prefill_chunk_size and context_window_size (chat_config=ChatConfig(prefill_chunk_size=128, context_window_size=128), it still out of memory. I'll try Phi-2 on the other PC which has larger Memory (32GB).

taeyeonlee@taeyeonlee-15U50Q-SP7PL:~/mlc-llm$ python3 test.py [2024-01-09 13:56:43] INFO auto_device.py:76: Found device: cuda:0 [2024-01-09 13:56:43] INFO auto_device.py:85: Not found device: rocm:0 [2024-01-09 13:56:43] INFO auto_device.py:85: Not found device: metal:0 [2024-01-09 13:56:44] INFO auto_device.py:76: Found device: vulkan:0 [2024-01-09 13:56:44] INFO auto_device.py:76: Found device: vulkan:1 [2024-01-09 13:56:44] INFO auto_device.py:76: Found device: vulkan:2 [2024-01-09 13:56:44] INFO auto_device.py:85: Not found device: opencl:0 [2024-01-09 13:56:44] INFO auto_device.py:33: Using device: cuda:0 [2024-01-09 13:56:44] INFO chat_module.py:366: Using model folder: /home/taeyeonlee/mlc-llm/dist/phi-2-q4f16_1-MLC [2024-01-09 13:56:44] INFO chat_module.py:367: Using mlc chat config: /home/taeyeonlee/mlc-llm/dist/phi-2-q4f16_1-MLC/mlc-chat-config.json [2024-01-09 13:56:44] INFO chat_module.py:756: Model lib not found. Now compiling model lib on device... [2024-01-09 13:56:44] INFO jit.py:83: Compiling using commands below: [2024-01-09 13:56:44] INFO jit.py:84: /usr/bin/python3 -m mlc_chat compile dist/phi-2-q4f16_1-MLC --opt 'flashinfer=1;cublas_gemm=1;cudagraph=0' --overrides 'context_window_size=128;prefill_chunk_size=128;tensor_parallel_shards=1' --device cuda:0 --output /tmp/tmpwqfogw4y/lib.so [2024-01-09 13:56:44] INFO auto_config.py:69: Found model configuration: dist/phi-2-q4f16_1-MLC/mlc-chat-config.json [2024-01-09 13:56:44] INFO auto_target.py:75: Detecting target device: cuda:0 [2024-01-09 13:56:44] INFO auto_target.py:77: Found target: {"thread_warp_size": 32, "arch": "sm_86", "max_threads_per_block": 1024, "max_num_threads": 1024, "kind": "cuda", "max_shared_memory_per_block": 49152, "tag": "", "keys": ["cuda", "gpu"]} [2024-01-09 13:56:44] INFO auto_target.py:94: Found host LLVM triple: x86_64-redhat-linux-gnu [2024-01-09 13:56:44] INFO auto_target.py:95: Found host LLVM CPU: alderlake [2024-01-09 13:56:44] INFO auto_target.py:242: Generating code for CUDA architecture: sm_86 [2024-01-09 13:56:44] INFO auto_target.py:243: To produce multi-arch fatbin, set environment variable MLC_MULTI_ARCH. Example: MLC_MULTI_ARCH=70,72,75,80,86,87,89,90 [2024-01-09 13:56:44] INFO auto_config.py:151: Found model type: phi-msft. Use --model-type to override. Compiling with arguments: --config PhiConfig(vocab_size=51200, n_positions=2048, n_embd=2560, n_layer=32, n_inner=10240, n_head=32, rotary_dim=32, position_embedding_base=10000, layer_norm_epsilon=1e-05, context_window_size=2048, prefill_chunk_size=2048, n_head_kv=32, head_dim=80, tensor_parallel_shards=1, kwargs={}) --quantization GroupQuantize(name='q4f16_1', kind='group-quant', group_size=32, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float16', num_elem_per_storage=8, num_storage_per_group=4, max_int_value=7) --model-type phi-msft --target {"thread_warp_size": 32, "host": {"mtriple": "x86_64-redhat-linux-gnu", "tag": "", "kind": "llvm", "mcpu": "alderlake", "keys": ["cpu"]}, "arch": "sm_86", "max_threads_per_block": 1024, "max_num_threads": 1024, "kind": "cuda", "max_shared_memory_per_block": 49152, "tag": "", "keys": ["cuda", "gpu"]} --opt flashinfer=1;cublas_gemm=0;cudagraph=0 --system-lib-prefix "" --output /tmp/tmpwqfogw4y/lib.so --overrides context_window_size=128;sliding_window_size=None;prefill_chunk_size=128;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=1 [2024-01-09 13:56:44] INFO compiler_flags.py:118: Overriding context_window_size from 2048 to 128 [2024-01-09 13:56:44] INFO compiler_flags.py:118: Overriding prefill_chunk_size from 2048 to 128 [2024-01-09 13:56:44] INFO compiler_flags.py:118: Overriding tensor_parallel_shards from 1 to 1 [2024-01-09 13:56:44] INFO compile.py:131: Creating model from: PhiConfig(vocab_size=51200, n_positions=2048, n_embd=2560, n_layer=32, n_inner=10240, n_head=32, rotary_dim=32, position_embedding_base=10000, layer_norm_epsilon=1e-05, context_window_size=2048, prefill_chunk_size=2048, n_head_kv=32, head_dim=80, tensor_parallel_shards=1, kwargs={}) [2024-01-09 13:56:45] INFO compile.py:141: Exporting the model to TVM Unity compiler [2024-01-09 13:56:45] WARNING attention.py:108: FlashInfer only head_dim in [128], but got 80 [2024-01-09 13:56:45] INFO compile.py:147: Running optimizations using TVM Unity [2024-01-09 13:56:45] INFO compile.py:160: Registering metadata: {'model_type': 'phi-msft', 'quantization': 'q4f16_1', 'context_window_size': 128, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 128, 'tensor_parallel_shards': 1, 'kv_cache_bytes': 41943040} [2024-01-09 13:56:45] INFO pipeline.py:35: Running TVM Relax graph-level optimizations [2024-01-09 13:56:46] INFO pipeline.py:35: Lowering to TVM TIR kernels [2024-01-09 13:56:47] INFO pipeline.py:35: Running TVM TIR-level optimizations [2024-01-09 13:56:52] INFO pipeline.py:35: Running TVM Dlight low-level optimizations [2024-01-09 13:56:56] INFO pipeline.py:35: Lowering to VM bytecode [2024-01-09 13:56:57] INFO estimate_memory_usage.py:55: [Memory usage] Function _initialize_effect: 0.00 MB [2024-01-09 13:56:57] INFO estimate_memory_usage.py:55: [Memory usage] Function decode: 13.39 MB [2024-01-09 13:56:57] INFO estimate_memory_usage.py:55: [Memory usage] Function prefill: 21.61 MB [2024-01-09 13:56:57] INFO estimate_memory_usage.py:55: [Memory usage] Function softmax_with_temperature: 0.00 MB [2024-01-09 13:56:57] INFO pipeline.py:35: Compiling external modules [2024-01-09 13:56:57] INFO pipeline.py:35: Compilation complete! Exporting to disk [2024-01-09 13:57:01] INFO compile.py:175: Generated: /tmp/tmpwqfogw4y/lib.so [2024-01-09 13:57:01] INFO jit.py:87: Using compiled model lib: /home/taeyeonlee/.cache/mlc_chat/model_lib/50d8c79ac3552d1cd17020682c4c9164.so [2024-01-09 13:57:02] INFO model_metadata.py:55: Total memory usage: 1554.06 MB (Parameters: 1492.45 MB. KVCache: 40.00 MB. Temporary buffer: 21.61 MB) [2024-01-09 13:57:02] INFO model_metadata.py:64: To reduce memory usage, tweak prefill_chunk_size, context_window_size and sliding_window_size Traceback (most recent call last): File "/home/taeyeonlee/mlc-llm/test.py", line 20, in main() File "/home/taeyeonlee/mlc-llm/test.py", line 9, in main cm = ChatModule( File "/home/taeyeonlee/.local/lib/python3.10/site-packages/mlc_chat/chat_module.py", line 774, in init self._reload(self.model_lib_path, self.model_path, user_chat_config_json_str) File "/home/taeyeonlee/.local/lib/python3.10/site-packages/mlc_chat/chat_module.py", line 988, in _reload self._reload_func(lib, model_path, app_config_json) File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.call File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3 File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL File "/home/taeyeonlee/.local/lib/python3.10/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error raise py_err File "/workspace/mlc-llm/cpp/llm_chat.cc", line 1541, in mlc::llm::LLMChatModule::GetFunction(tvm::runtime::String const&, tvm::runtime::ObjectPtr const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue)#1}::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue) const File "/workspace/mlc-llm/cpp/llm_chat.cc", line 575, in mlc::llm::LLMChat::Reload(tvm::runtime::TVMArgValue, tvm::runtime::String, tvm::runtime::String) File "/workspace/mlc-llm/cpp/llm_chat.cc", line 205, in LoadParams ValueError: Traceback (most recent call last): 5: mlc::llm::LLMChatModule::GetFunction(tvm::runtime::String const&, tvm::runtime::ObjectPtr const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue)#1}::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue) const at /workspace/mlc-llm/cpp/llm_chat.cc:1541 4: mlc::llm::LLMChat::Reload(tvm::runtime::TVMArgValue, tvm::runtime::String, tvm::runtime::String) at /workspace/mlc-llm/cpp/llm_chat.cc:575 3: LoadParams at /workspace/mlc-llm/cpp/llm_chat.cc:205 2: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (std::cxx11::basic_string<char, std::char_traits, std::allocator > const&, int, int)>::AssignTypedLambda<void (*)(std::cxx11::basic_string<char, std::char_traits, std::allocator > const&, int, int)>(void ()(std::cxx11::basic_string<char, std::char_traits, std::allocator > const&, int, int), std::cxx11::basic_string<char, std::char_traits, std::allocator >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue)#1}> >::Call(tvm::runtime::PackedFuncObj const, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue) 1: tvm::runtime::relax_vm::NDArrayCache::Load(std::cxx11::basic_string<char, std::char_traits, std::allocator > const&, int, int) 0: _ZN3tvm7runtime6deta 10: mlc::llm::LLMChatModule::GetFunction(tvm::runtime::String const&, tvm::runtime::ObjectPtr const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue)#1}::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue) const at /workspace/mlc-llm/cpp/llm_chat.cc:1541 9: mlc::llm::LLMChat::Reload(tvm::runtime::TVMArgValue, tvm::runtime::String, tvm::runtime::String) at /workspace/mlc-llm/cpp/llm_chat.cc:575 8: LoadParams at /workspace/mlc-llm/cpp/llm_chat.cc:205 7: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (std::cxx11::basic_string<char, std::char_traits, std::allocator > const&, int, int)>::AssignTypedLambda<void ()(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, int, int)>(void ()(std::cxx11::basic_string<char, std::char_traits, std::allocator > const&, int, int), std::cxx11::basic_string<char, std::char_traits, std::allocator >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue)#1}> >::Call(tvm::runtime::PackedFuncObj const, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue) 6: tvm::runtime::relax_vm::NDArrayCache::Load(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, int, int) 5: tvm::runtime::relax_vm::NDArrayCacheMetadata::FileRecord::Load(DLDevice, std::cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::cxx11::basic_string<char, std::char_traits, std::allocator >, tvm::runtime::Optional) const 4: tvm::runtime::relax_vm::NDArrayCacheMetadata::FileRecord::ParamRecord::Load(DLDevice, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, tvm::runtime::Optional) const 3: tvm::runtime::NDArray::Empty(tvm::runtime::ShapeTuple, DLDataType, DLDevice, tvm::runtime::Optional) 2: tvm::runtime::DeviceAPI::AllocDataSpace(DLDevice, int, long const, DLDataType, tvm::runtime::Optional) 1: tvm::runtime::CUDADeviceAPI::AllocDataSpace(DLDevice, unsigned long, unsigned long, DLDataType) 0: _ZN3tvm7runtime6deta File "/workspace/tvm/src/runtime/relax_vm/ndarray_cache_support.cc", line 255 ValueError: Error when loading parameters from params_shard_46.bin: [13:57:02] /workspace/tvm/src/runtime/cuda/cuda_device_api.cc:129: InternalError: Check failed: (e == cudaSuccess || e == cudaErrorCudartUnloading) is false: CUDA: out of memory


When using Phi-1.5 on this laptop (16GB RAM) with the lower prefill_chunk_size and context_window_size (chat_config=ChatConfig(prefill_chunk_size=128, context_window_size=128), it's working, even though the decoded answer is not satisfied.

taeyeonlee@taeyeonlee-15U50Q-SP7PL:~/mlc-llm$ python3 test.py [2024-01-10 11:33:48] INFO auto_device.py:76: Found device: cuda:0 [2024-01-10 11:33:48] INFO auto_device.py:85: Not found device: rocm:0 [2024-01-10 11:33:48] INFO auto_device.py:85: Not found device: metal:0 [2024-01-10 11:33:49] INFO auto_device.py:76: Found device: vulkan:0 [2024-01-10 11:33:49] INFO auto_device.py:76: Found device: vulkan:1 [2024-01-10 11:33:49] INFO auto_device.py:76: Found device: vulkan:2 [2024-01-10 11:33:49] INFO auto_device.py:85: Not found device: opencl:0 [2024-01-10 11:33:49] INFO auto_device.py:33: Using device: cuda:0 [2024-01-10 11:33:49] INFO chat_module.py:366: Using model folder: /home/taeyeonlee/mlc-llm/dist/phi-1_5-q4f16_1-MLC [2024-01-10 11:33:49] INFO chat_module.py:367: Using mlc chat config: /home/taeyeonlee/mlc-llm/dist/phi-1_5-q4f16_1-MLC/mlc-chat-config.json [2024-01-10 11:33:49] INFO chat_module.py:756: Model lib not found. Now compiling model lib on device... [2024-01-10 11:33:49] INFO jit.py:83: Compiling using commands below: [2024-01-10 11:33:49] INFO jit.py:84: /usr/bin/python3 -m mlc_chat compile dist/phi-1_5-q4f16_1-MLC --opt 'flashinfer=1;cublas_gemm=1;cudagraph=0' --overrides 'context_window_size=128;prefill_chunk_size=128;tensor_parallel_shards=1' --device cuda:0 --output /tmp/tmpu9mfl9ps/lib.so [2024-01-10 11:33:49] INFO auto_config.py:69: Found model configuration: dist/phi-1_5-q4f16_1-MLC/mlc-chat-config.json [2024-01-10 11:33:49] INFO auto_target.py:75: Detecting target device: cuda:0 [2024-01-10 11:33:49] INFO auto_target.py:77: Found target: {"thread_warp_size": 32, "arch": "sm_86", "max_threads_per_block": 1024, "max_num_threads": 1024, "kind": "cuda", "max_shared_memory_per_block": 49152, "tag": "", "keys": ["cuda", "gpu"]} [2024-01-10 11:33:49] INFO auto_target.py:94: Found host LLVM triple: x86_64-redhat-linux-gnu [2024-01-10 11:33:49] INFO auto_target.py:95: Found host LLVM CPU: alderlake [2024-01-10 11:33:49] INFO auto_target.py:242: Generating code for CUDA architecture: sm_86 [2024-01-10 11:33:49] INFO auto_target.py:243: To produce multi-arch fatbin, set environment variable MLC_MULTI_ARCH. Example: MLC_MULTI_ARCH=70,72,75,80,86,87,89,90 [2024-01-10 11:33:49] INFO auto_config.py:151: Found model type: phi-msft. Use --model-type to override. Compiling with arguments: --config PhiConfig(vocab_size=51200, n_positions=2048, n_embd=2048, n_layer=24, n_inner=8192, n_head=32, rotary_dim=32, position_embedding_base=10000, layer_norm_epsilon=1e-05, context_window_size=2048, prefill_chunk_size=2048, n_head_kv=32, head_dim=64, tensor_parallel_shards=1, kwargs={}) --quantization GroupQuantize(name='q4f16_1', kind='group-quant', group_size=32, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float16', num_elem_per_storage=8, num_storage_per_group=4, max_int_value=7) --model-type phi-msft --target {"thread_warp_size": 32, "host": {"mtriple": "x86_64-redhat-linux-gnu", "tag": "", "kind": "llvm", "mcpu": "alderlake", "keys": ["cpu"]}, "arch": "sm_86", "max_threads_per_block": 1024, "max_num_threads": 1024, "kind": "cuda", "max_shared_memory_per_block": 49152, "tag": "", "keys": ["cuda", "gpu"]} --opt flashinfer=1;cublas_gemm=0;cudagraph=0 --system-lib-prefix "" --output /tmp/tmpu9mfl9ps/lib.so --overrides context_window_size=128;sliding_window_size=None;prefill_chunk_size=128;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=1 [2024-01-10 11:33:49] INFO compiler_flags.py:118: Overriding context_window_size from 2048 to 128 [2024-01-10 11:33:49] INFO compiler_flags.py:118: Overriding prefill_chunk_size from 2048 to 128 [2024-01-10 11:33:49] INFO compiler_flags.py:118: Overriding tensor_parallel_shards from 1 to 1 [2024-01-10 11:33:49] INFO compile.py:131: Creating model from: PhiConfig(vocab_size=51200, n_positions=2048, n_embd=2048, n_layer=24, n_inner=8192, n_head=32, rotary_dim=32, position_embedding_base=10000, layer_norm_epsilon=1e-05, context_window_size=2048, prefill_chunk_size=2048, n_head_kv=32, head_dim=64, tensor_parallel_shards=1, kwargs={}) [2024-01-10 11:33:49] INFO compile.py:141: Exporting the model to TVM Unity compiler [2024-01-10 11:33:49] WARNING attention.py:108: FlashInfer only head_dim in [128], but got 64 [2024-01-10 11:33:50] INFO compile.py:147: Running optimizations using TVM Unity [2024-01-10 11:33:50] INFO compile.py:160: Registering metadata: {'model_type': 'phi-msft', 'quantization': 'q4f16_1', 'context_window_size': 128, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 128, 'tensor_parallel_shards': 1, 'kv_cache_bytes': 25165824} [2024-01-10 11:33:50] INFO pipeline.py:35: Running TVM Relax graph-level optimizations [2024-01-10 11:33:51] INFO pipeline.py:35: Lowering to TVM TIR kernels [2024-01-10 11:33:51] INFO pipeline.py:35: Running TVM TIR-level optimizations [2024-01-10 11:33:54] INFO pipeline.py:35: Running TVM Dlight low-level optimizations [2024-01-10 11:33:59] INFO pipeline.py:35: Lowering to VM bytecode [2024-01-10 11:33:59] INFO estimate_memory_usage.py:55: [Memory usage] Function _initialize_effect: 0.00 MB [2024-01-10 11:33:59] INFO estimate_memory_usage.py:55: [Memory usage] Function decode: 8.77 MB [2024-01-10 11:33:59] INFO estimate_memory_usage.py:55: [Memory usage] Function prefill: 15.73 MB [2024-01-10 11:33:59] INFO estimate_memory_usage.py:55: [Memory usage] Function softmax_with_temperature: 0.00 MB [2024-01-10 11:33:59] INFO pipeline.py:35: Compiling external modules [2024-01-10 11:33:59] INFO pipeline.py:35: Compilation complete! Exporting to disk [2024-01-10 11:34:03] INFO compile.py:175: Generated: /tmp/tmpu9mfl9ps/lib.so [2024-01-10 11:34:03] INFO jit.py:87: Using compiled model lib: /home/taeyeonlee/.cache/mlc_chat/model_lib/c7e0024fece289d69d5426677e94dfbb.so [2024-01-10 11:34:04] INFO model_metadata.py:55: Total memory usage: 801.37 MB (Parameters: 761.64 MB. KVCache: 24.00 MB. Temporary buffer: 15.73 MB) [2024-01-10 11:34:04] INFO model_metadata.py:64: To reduce memory usage, tweak prefill_chunk_size, context_window_size and sliding_window_size [11:34:04] /workspace/mlc-llm/cpp/llm_chat.cc:705: Warning: The prompt tokens are too long and the generated text may be incomplete, due to limited max_window_size.

A:

You can use cv2.inRange with an array of gray levels.

import cv2 import numpy as np

image = cv2.imread('my_image.png', cv2.IMREAD_GRAYSCALE) gray_levels = [0, 10, 20]

mask = cv2.inRange(image, np.min(gray_levels), np.max(gray_levels))

This will give you a

taeyeonlee commented 8 months ago

Thanks for your support. @junrushao When I try Phi-2 on the PC (Ubuntu + RTX2060 12GB + CPU 32GB), it's working well. The LLM model needs the GPU Memory (3043.16 MB), according to the below log.

taeyeon@taeyeon-ubuntu-pc:~/mlc-llm$ python3 test.py [2024-01-16 23:00:01] INFO auto_device.py:76: Found device: vulkan:0 [2024-01-16 23:00:01] INFO auto_device.py:76: Found device: vulkan:1 [2024-01-16 23:00:01] INFO chat_module.py:370: Using model folder: /home/taeyeon/mlc-llm/dist/phi-2-MLC [2024-01-16 23:00:01] INFO chat_module.py:371: Using mlc chat config: /home/taeyeon/mlc-llm/dist/phi-2-MLC/mlc-chat-config.json [2024-01-16 23:00:01] INFO chat_module.py:760: Model lib not found. Now compiling model lib on device... /home/taeyeon/.local/lib/python3.10/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/taeyeon/.local/lib/python3.10/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero. return self._float_to_str(self.smallest_subnormal) /home/taeyeon/.local/lib/python3.10/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/taeyeon/.local/lib/python3.10/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero. return self._float_to_str(self.smallest_subnormal) [2024-01-16 23:00:02] INFO jit.py:106: Using cached model lib: /home/taeyeon/.cache/mlc_chat/model_lib/5f9614a7f67e3d57981ec0c3b3e17ce5.so [2024-01-16 23:00:02] INFO model_metadata.py:55: Total memory usage: 3043.16 MB (Parameters: 1492.45 MB. KVCache: 640.00 MB. Temporary buffer: 910.71 MB) [2024-01-16 23:00:02] INFO model_metadata.py:64: To reduce memory usage, tweak prefill_chunk_size, context_window_size and sliding_window_size Tue Jan 16 23:00:03 2024 Phi-2 is a Transformer-based model that can be used to generate human-like text. It has been trained on a mixture of Synthetic and Web datasets for NLP and programming tasks.

Example 4: Language Modeling with GPT-2 Tue Jan 16 23:00:05 2024 taeyeon@taeyeon-ubuntu-pc:~/mlc-llm$