mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
18.59k stars 1.5k forks source link

[Bug] Apple Metal/MPS -- TVM/MLC-LLM won't compile from source #2540

Closed BuildBackBuehler closed 2 months ago

BuildBackBuehler commented 3 months ago

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

I've compiled each a few times each. But since I updated and attempted to compile, I've been unable to (*except once, not sure if pure luck or a matter of a stock, no features, build). With that little detail of success, it was off a fresh git repo DL, whereas when I have dropped the features after a failed build back to stock, it still fails.

Features that seem to exacerbate the issue: BLAS, MKL, CoreML, Arm Compute Lib., basically anything that'd go through MPS and it causes this foundational error (as in there may be an error about inability to find, for the file src/runtime/contrib/ACL/allocator.cc, <#include acl/runtime/IAllocator.h> + Core/Types.h (which doesn't make sense, I've gone out of the way to incorporate the precise directory that IAlloc. is in my Include flags/CMake conf. (ACL/arm_compute/core + ACL/arm_compute/runtime):

[ 53%] Building CXX object tvm/CMakeFiles/tvm_objs.dir/src/tir/analysis/verify_ssa.cc.o
/Users/zack/.home/gitrepos/LLMLife/backend/tvm-unity/src/runtime/contrib/mps/conv.mm:36:25: error: 'CopyDataFromTo' is a protected member of 'tvm::runtime::metal::MetalWorkspace'
   36 |   entry_ptr->metal_api->CopyDataFromTo((__bridge void*)mtlbuf, 0, (__bridge void*)temp, 0,
      |                         ^
/Users/zack/.home/gitrepos/LLMLife/backend/tvm-unity/src/runtime/contrib/mps/../../metal/metal_common.h:187:8: note: declared protected here
  187 |   void CopyDataFromTo(const void* from, size_t from_size, void* to, size_t to_size, size_t size,
      |        ^
/Users/zack/.home/gitrepos/LLMLife/backend/tvm-unity/src/runtime/contrib/mps/conv.mm:72:25: error: 'CopyDataFromTo' is a protected member of 'tvm::runtime::metal::MetalWorkspace'
   72 |   entry_ptr->metal_api->CopyDataFromTo((__bridge void*)temp, 0, (__bridge void*)mtlbuf, 0,
      |                         ^
/Users/zack/.home/gitrepos/LLMLife/backend/tvm-unity/src/runtime/contrib/mps/../../metal/metal_common.h:187:8: note: declared protected here
  187 |   void CopyDataFromTo(const void* from, size_t from_size, void* to, size_t to_size, size_t size,
      |        ^
/Users/zack/.home/gitrepos/LLMLife/backend/tvm-unity/src/runtime/contrib/mps/conv.mm:106:53: error: no member named 'GetCommandQueue' in 'tvm::runtime::metal::MetalWorkspace'
  106 |   id<MTLCommandQueue> queue = entry_ptr->metal_api->GetCommandQueue(data->device);
      |                               ~~~~~~~~~~~~~~~~~~~~  ^
/Users/zack/.home/gitrepos/LLMLife/backend/tvm-unity/src/runtime/contrib/mps/conv.mm:115:25: error: 'CopyDataFromTo' is a protected member of 'tvm::runtime::metal::MetalWorkspace'
  115 |   entry_ptr->metal_api->CopyDataFromTo((__bridge void*)bufB, 0, (__bridge void*)tempB, 0,
      |                         ^
/Users/zack/.home/gitrepos/LLMLife/backend/tvm-unity/src/runtime/contrib/mps/../../metal/metal_common.h:187:8: note: declared protected here
  187 |   void CopyDataFromTo(const void* from, size_t from_size, void* to, size_t to_size, size_t size,
      |        ^
4 errors generated.

I originally posted about this issue in TVM 6 days ago...but it is inactive. https://github.com/mlc-ai/relax/issues/321 More errors/gen. information for context if needed.

Environment

Additional context

Sadly couldn't find anything on the net about how to fix this error. Figured there'd be a lot of these MTLCommandQueue errors but nothing concrete.

I only tried compiling MLC-LLM 1 or 2 times. And that was with my TVM (that compiled that one time...well the .dylibs, but not the pure 100% instance, I had attempted more builds after). I suppose I'll give a try to compile MLC w/ 3rd Party TVM, but I need to manipulate MLC's quantization file so I can import a custom-quantized model of mine.

tqchen commented 3 months ago

Likely you don't need to turn on arm compute and mps since we generate our own metal code

BuildBackBuehler commented 3 months ago

Hm, funny, came back here to comment that I got MLC-LLM to compile -- without MPS on. Problem is that now when I went to compile a model, I kept getting

    raise ValueError(f"No target detected from device: {hint}. Please specify explicitly")

And when I included --device metal

File "/Users/zack/.home/gitrepos/LLMLife/frontend/mlc-llm/python/mlc_llm/support/auto_device.py", line 42, in detect_device
    raise ValueError(f"Device is not found on your local environment: {device_hint}")

I also have device="mps" set as an env. var. and MTLDevice=1

tqchen commented 3 months ago

ah, you need to write device="metal"

BuildBackBuehler commented 3 months ago

ah, you need to write device="metal"

🤦‍♂️ it's always something so stupid, agh! Thank you

Welp, ValueError: Cannot detect device `metal(0)`. Please make sure the device and its driver is installed properly, and TVM is compiled with the driver I'm guessing when I did that compilation that worked, I must've neglected to include Metal even. Will see what happens when I do another clean-build + Metal.

And while I'm here, I neglected to mention the warning that always serve as a precursor to the 4 "foundational" errors I had mentioned.

[ 16%] Building CXX object CMakeFiles/tvm_runtime_objs.dir/src/runtime/workspace_pool.cc.o
/Users/zack/.home/gitrepos/LLMLife/backend/tvm-unity/src/runtime/threading_backend.cc:296:30: warning: zero as null pointer constant [-Wzero-as-null-pointer-constant]
  296 |     SetThreadFullCpuAffinity(CURRENT_THREAD_HANDLE, mode);
      |                              ^
/Users/zack/.home/gitrepos/LLMLife/backend/tvm-unity/src/runtime/threading_backend.cc:51:77: note: expanded from macro 'CURRENT_THREAD_HANDLE'
   51 | #define CURRENT_THREAD_HANDLE (static_cast<std::thread::native_handle_type>(0))
      |                                                                             ^
/Users/zack/.home/gitrepos/LLMLife/backend/tvm-unity/src/runtime/threading_backend.cc:441:25: warning: zero as null pointer constant [-Wzero-as-null-pointer-constant]
  441 |       SetThreadAffinity(CURRENT_THREAD_HANDLE,
      |                         ^
/Users/zack/.home/gitrepos/LLMLife/backend/tvm-unity/src/runtime/threading_backend.cc:51:77: note: expanded from macro 'CURRENT_THREAD_HANDLE'
   51 | #define CURRENT_THREAD_HANDLE (static_cast<std::thread::native_handle_type>(0))

Edit: TVM just compiled with MPS off! (!!! because I was able to use ACL, MKL and whatever else) I imagine that when I recompile MLC it'll work. Should I leave this open since it is still...technically an issue? 😂 At least I had been compiling with MPS on for months

BuildBackBuehler commented 3 months ago

I have a feeling this is the answer to my prayers.

I had swapped out DMLC_Core and the related files that were changed when that was updated without luck. Then sought to check out the last time those threading/pool files were modified.

Haven't tried those files yet but figure it must be Apple's policies w/ threading and this seems to confirm

But donno, just glazed over and saw a couple apples but maybe compared to a couple oranges.

Just came up in my search so... https://github.com/JuvignyEnsta/IN203_SUPPORT_COURS/blob/master/Examples/thread_extension.cpp https://github.com/rurban/smhasher/blob/master/Platform.h

And noticed this change https://github.com/mlc-ai/relax/commit/3a423615eed95b27e1e07b30b294999024d7e2e9

BuildBackBuehler commented 2 months ago

I was able to compile TVM/MLC but its producing segmentation fault errors on conversion (weights) of my Codestral model. Also gotten errors with compiling a 3-bit Omniquant Llama model (gen_config worked fine) and trying to chat with an AQLM 2-bit model I managed to get to compile previously. However, I'm not sure if the .dylib I'd compiled was legitimate (used no_quant for gen_config/compilation) so I'd need to double back on another compile anyways.

Seems it is the fact that there's only a

"Protected: CopyDataFromTo(vars, etc. etc.)"

No "Public: CopyDataFromTo" defined in runtime/metal/metal_common.h. There's also no "GetCommandQueue" (used in metal_api.mm and conv.mm (IIRC)) defined in metal_common.h.

I remedied those issues. Then it was only a matter of introducing PublicCopyDataFromTo function & the GCQ definition in conv.mm, gemm.mm (contrib/MPS files) and metal_api.mm. Well, if the resolution is kosher, but I guess from the errors I'm experiencing, it breaks something important (I'm guessing the data should be transported in a protected state. I'm thinking that there must've just been a discrepancy (in conv.mm or metal_api?)/missing definition (GCQ) prohibiting it from protected data transference

Compiling with arguments:
  --config          LlamaConfig(hidden_size=8192, intermediate_size=28672, num_attention_heads=64, num_hidden_layers=80, rms_norm_eps=1e-05, vocab_size=128256, position_embedding_base=500000.0, context_window_size=8192, prefill_chunk_size=2048, num_key_value_heads=8, head_dim=128, tensor_parallel_shards=1, max_batch_size=80, kwargs={})
  --quantization    GroupQuantize(name='AQLM_2bit', kind='group-quant', group_size=16, quantize_dtype='int2', storage_dtype='uint32', model_dtype='float16', linear_weight_layout='NK', quantize_embedding=True, quantize_final_fc=True, num_elem_per_storage=16, num_storage_per_group=1, max_int_value=1)
  --model-type      llama
  --target          {"max_num_threads": 256, "max_shared_memory_per_block": 32768, "max_function_args": 31, "max_threads_per_block": 1024, "thread_warp_size": 32, "keys": ["metal", "gpu"], "host": {"keys": ["cpu"], "mtriple": "5", "tag": "", "kind": "llvm"}, "tag": "", "kind": "metal"}
  --opt             flashinfer=0;cublas_gemm=0;faster_transformer=0;cudagraph=0;cutlass=0;ipc_allreduce_strategy=NONE
  --system-lib-prefix ""
  --output          /Users/zack/.home/local/models/2bitllama/aqlm.dylib
  --overrides       context_window_size=None;sliding_window_size=None;prefill_chunk_size=None;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=None
[2024-06-12 20:10:05] INFO compile.py:127: Creating model from: LlamaConfig(hidden_size=8192, intermediate_size=28672, num_attention_heads=64, num_hidden_layers=80, rms_norm_eps=1e-05, vocab_size=128256, position_embedding_base=500000.0, context_window_size=8192, prefill_chunk_size=2048, num_key_value_heads=8, head_dim=128, tensor_parallel_shards=1, max_batch_size=80, kwargs={})
[2024-06-12 20:10:05] INFO compile.py:145: Exporting the model to TVM Unity compiler
[2024-06-12 20:10:13] INFO compile.py:151: Running optimizations using TVM Unity
[2024-06-12 20:10:13] INFO compile.py:171: Registering metadata: {'model_type': 'llama', 'quantization': 'AQLM_2bit', 'context_window_size': 8192, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 2048, 'tensor_parallel_shards': 1, 'kv_state_kind': 'kv_cache', 'max_batch_size': 80}
[2024-06-12 20:10:15] INFO pipeline.py:52: Running TVM Relax graph-level optimizations
[2024-06-12 20:22:53] INFO pipeline.py:52: Lowering to TVM TIR kernels
[2024-06-12 20:23:09] INFO pipeline.py:52: Running TVM TIR-level optimizations
[2024-06-12 20:24:30] INFO pipeline.py:52: Running TVM Dlight low-level optimizations
[2024-06-12 20:24:31] INFO pipeline.py:52: Lowering to VM bytecode
[2024-06-12 20:24:44] INFO estimate_memory_usage.py:58: [Memory usage] Function `alloc_embedding_tensor`: 32.00 MB
[2024-06-12 20:24:44] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_decode`: 23.12 MB
[2024-06-12 20:24:45] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_decode_to_last_hidden_states`: 24.38 MB
[2024-06-12 20:24:45] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_prefill`: 593.25 MB
[2024-06-12 20:24:45] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_prefill_to_last_hidden_states`: 624.00 MB
[2024-06-12 20:24:45] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_select_last_hidden_states`: 1.25 MB
[2024-06-12 20:24:45] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_verify`: 592.00 MB
[2024-06-12 20:24:45] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_verify_to_last_hidden_states`: 624.00 MB
[2024-06-12 20:24:45] INFO estimate_memory_usage.py:58: [Memory usage] Function `create_tir_paged_kv_cache`: 0.00 MB
[2024-06-12 20:24:45] INFO estimate_memory_usage.py:58: [Memory usage] Function `decode`: 0.29 MB
[2024-06-12 20:24:45] INFO estimate_memory_usage.py:58: [Memory usage] Function `decode_to_last_hidden_states`: 0.30 MB
[2024-06-12 20:24:45] INFO estimate_memory_usage.py:58: [Memory usage] Function `embed`: 32.00 MB
[2024-06-12 20:24:45] INFO estimate_memory_usage.py:58: [Memory usage] Function `gather_hidden_states`: 0.00 MB
[2024-06-12 20:24:45] INFO estimate_memory_usage.py:58: [Memory usage] Function `get_logits`: 0.00 MB
[2024-06-12 20:24:46] INFO estimate_memory_usage.py:58: [Memory usage] Function `prefill`: 592.02 MB
[2024-06-12 20:24:46] INFO estimate_memory_usage.py:58: [Memory usage] Function `prefill_to_last_hidden_states`: 624.00 MB
[2024-06-12 20:24:46] INFO estimate_memory_usage.py:58: [Memory usage] Function `scatter_hidden_states`: 0.00 MB
[2024-06-12 20:24:46] INFO estimate_memory_usage.py:58: [Memory usage] Function `softmax_with_temperature`: 0.00 MB
[2024-06-12 20:24:51] INFO pipeline.py:52: Compiling external modules
[2024-06-12 20:24:51] INFO pipeline.py:52: Compilation complete! Exporting to disk
!!!!!!! TVM encountered a Segfault !!!!!!!
Stack trace:

[1]    30875 segmentation fault  mlc_llm compile /Users/zack/.home/local/models/2bitllama --host 5 --device
BuildBackBuehler commented 2 months ago

[1] 63036 segmentation fault mlc_llm convert_weight /Users/zack/.home/local/models/Uncensored_Llama-70B /Users/zack/.home/local/mise/installs/python/3.11.9/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '

I'm losing my sanity here at this point. My Python/Poetry appear to be ARM64...so it nixes that possibility. I checked because I saw all the multiprocessing errors here have been related to that or other user errors.

My last remaining guess, and I wish I just turned it off when I turned off ARM Comp. Lib. is BLAS. I'm feeling a bit dumb now really, because I believe when issues started and I posted this in TVM, I noted Apple BLAS was suspect. I think the code is out of date because it shoots warnings (and before I messed around w/ the Metal/MPS code, errors!) about how the code is relying on sgemm and dgemm or w/e functions/scripts that are deprecated. And I tried everything under the sun to force CMake to incorporate the new Apple BLAS without change. So I'll be turning that off now, too.

I do have tons of modules on normally. AOTExec, UMA, BNNS, Threads, RPC, CPP TVM, CPP RPC, Profiler, Graph Executor, CoreML, TCMalloc, MLIR, Pipeline. I think that might be it 😂😅

BuildBackBuehler commented 2 months ago

The CMake module for Modules/OpenMP.cmake should be updated because there's nothing Apple-friendly

OpenMPcmake.txt

/Users/zack/.home/gitrepos/LLMLife/backend/tvm/src/relay/backend/contrib/bnns/codegen.cc:93:16: error: call to 'GetRootCall' is ambiguous
   93 |         call = GetRootCall(body, 1, {"nn.conv2d", add_op_type});
      |                ^~~~~~~~~~~

codegen copycc.txt Also had to update this to get rid of an error

Sadly stock, with all options off (except Metal), segfault errors on Convert_Weight + Compile 😭

Edit: 🤦 -- turns out the dang EXE binary wasn't updating, no wonder nothing was happening. Just got it working with a stock build. Time to try to piece it back up to the full shebang

Edit 2: Also, this should be added to Metal_Device_API.mm case kAvailableGlobalMemory: break;

(Under ICHECK_LT(index, devices.size()) << "Invalid device id " << index; switch (kind) { case kMaxThreadsPerBlock: { *rv = static_cast([devices[dev.device_id] maxThreadsPerThreadgroup].width); break; })

Edit 3: Seems I've gotten everything on besides MPS, hopefully that can be fixed sooner than later!