[Bug] InternalError with starcoder models

ziyu-guo commented 1 year ago

🐛 Bug

InternalError when converting starcoderbase from HuggingFace

Error msg:

zguo@HQ-N7QDT262TG:/Users/zguo/workplace/mlc-ai/mlc-llm$python3 build.py --hf-path bigcode/starcoderbase --quantization q3f16_0 --max-seq-len 768 Weights exist at dist/models/starcoderbase, skipping download. Using path "dist/models/starcoderbase" for model "starcoderbase" Database paths: ['log_db/rwkv-raven-3b', 'log_db/redpajama-3b-q4f16', 'log_db/redpajama-3b-q4f32', 'log_db/rwkv-raven-1b5', 'log_db/dolly-v2-3b', 'log_db/rwkv-raven-7b', 'log_db/vicuna-v1-7b'] [11:24:10] /Users/zguo/workplace/mlc-ai/relax/src/runtime/metal/metal_device_api.mm:165: Intializing Metal device 0, name=Apple M1 Max Host CPU dection: Target triple: arm64-apple-darwin21.6.0 Process triple: arm64-apple-darwin21.6.0 Host CPU: apple-m1 Target configured: metal -keys=metal,gpu -max_function_args=31 -max_num_threads=256 -max_shared_memory_per_block=32768 -max_threads_per_block=1024 -thread_warp_size=32 Load cached module from dist/starcoderbase-q3f16_0/mod_cache_before_build_metal.pkl and skip tracing. You can use --use-cache=0 to retrace [11:24:12] /Users/zguo/workplace/mlc-ai/relax/src/relax/ir/block_builder.cc:64: Warning: BlockBuilder destroyed with remaining blocks! Traceback (most recent call last): File "/Users/zguo/workplace/mlc-ai/mlc-llm/build.py", line 470, in main() File "/Users/zguo/workplace/mlc-ai/mlc-llm/build.py", line 462, in main build(mod, ARGS) File "/Users/zguo/workplace/mlc-ai/mlc-llm/build.py", line 404, in build ex = relax.build(mod_deploy, args.target, system_lib=args.system_lib) File "/Users/zguo/workplace/mlc-ai/relax/python/tvm/relax/vm_build.py", line 321, in build new_mod = seq(mod) File "/Users/zguo/workplace/mlc-ai/relax/python/tvm/ir/transform.py", line 238, in call return _ffi_transform_api.RunPass(self, mod) File "/Users/zguo/workplace/mlc-ai/relax/python/tvm/_ffi/_ctypes/packed_func.py", line 238, in call raise get_last_ffi_error() tvm.error.InternalError: Traceback (most recent call last): File "/Users/zguo/workplace/mlc-ai/relax/src/ir/module.cc", line 288 InternalError: Check failed: (it != functions.end()) is false: There is no definition of I.GlobalVar("encode")

To Reproduce

Steps to reproduce the behavior:

Using mlc-llm commit id b71bd391239343d262c9931343e20a6aa8ce5bdc
Using TVM with associated submodule commit id
python3 build.py --hf-path bigcode/starcoderbase --quantization q3f16_0 --max-seq-len 768

Expected behavior

Generate tvm .so model files with params

Environment

Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): Metal on MBP
Operating system (e.g. Ubuntu/Windows/MacOS/...): MacOS 12.6.3
Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...) MacBookPro 2022 w/ M1Max
How you installed MLC-LLM (conda, source): source
How you installed TVM-Unity (pip, source): source from submodule. Also tried TOT mlc-ai/relax
Python version (e.g. 3.10): 3.9.16
GPU driver version (if applicable):
CUDA/cuDNN version (if applicable):
TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models):

python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))" USE_GTEST: AUTO SUMMARIZE: OFF USE_IOS_RPC: OFF USE_ETHOSU: CUDA_VERSION: NOT-FOUND USE_LIBBACKTRACE: AUTO DLPACK_PATH: 3rdparty/dlpack/include USE_TENSORRT_CODEGEN: OFF USE_THRUST: OFF USE_TARGET_ONNX: OFF USE_AOT_EXECUTOR: ON BUILD_DUMMY_LIBTVM: OFF USE_CUDNN: OFF USE_TENSORRT_RUNTIME: OFF USE_ARM_COMPUTE_LIB_GRAPH_EXECUTOR: OFF USE_CCACHE: AUTO USE_ARM_COMPUTE_LIB: OFF USE_CPP_RTVM: USE_OPENCL_GTEST: /path/to/opencl/gtest USE_MKL: OFF USE_PT_TVMDSOOP: OFF USE_CLML: OFF USE_STACKVM_RUNTIME: OFF USE_GRAPH_EXECUTOR_CUDA_GRAPH: OFF ROCM_PATH: /opt/rocm USE_DNNL: OFF USE_VITIS_AI: OFF USE_LLVM: /opt/homebrew/opt/llvm/bin/llvm-config USE_VERILATOR: OFF USE_TF_TVMDSOOP: OFF USE_THREADS: ON USE_MSVC_MT: OFF BACKTRACE_ON_SEGFAULT: OFF USE_GRAPH_EXECUTOR: ON USE_ROCBLAS: OFF GIT_COMMIT_HASH: f34df8fbec7e5bd5804891dad6224fb82e2ec9ec USE_VULKAN: OFF USE_RUST_EXT: OFF USE_CUTLASS: OFF USE_CPP_RPC: OFF USE_HEXAGON: OFF USE_CUSTOM_LOGGING: OFF USE_UMA: OFF USE_FALLBACK_STL_MAP: OFF USE_SORT: ON USE_RTTI: ON GIT_COMMIT_TIME: 2023-07-09 10:56:28 -0400 USE_HEXAGON_SDK: /path/to/sdk USE_BLAS: none USE_ETHOSN: OFF USE_LIBTORCH: OFF USE_RANDOM: ON USE_CUDA: OFF USE_COREML: OFF USE_AMX: OFF BUILD_STATIC_RUNTIME: OFF USE_CMSISNN: OFF USE_KHRONOS_SPIRV: OFF USE_CLML_GRAPH_EXECUTOR: OFF USE_TFLITE: OFF USE_HEXAGON_GTEST: /path/to/hexagon/gtest PICOJSON_PATH: 3rdparty/picojson USE_OPENCL_ENABLE_HOST_PTR: OFF INSTALL_DEV: OFF USE_PROFILER: ON USE_NNPACK: OFF LLVM_VERSION: 15.0.7 USE_OPENCL: OFF COMPILER_RT_PATH: 3rdparty/compiler-rt RANG_PATH: 3rdparty/rang/include USE_SPIRV_KHR_INTEGER_DOT_PRODUCT: OFF USE_OPENMP: OFF USE_BNNS: OFF USE_CUBLAS: OFF USE_METAL: ON USE_MICRO_STANDALONE_RUNTIME: OFF USE_HEXAGON_EXTERNAL_LIBS: OFF USE_ALTERNATIVE_LINKER: AUTO USE_BYODT_POSIT: OFF USE_HEXAGON_RPC: OFF USE_MICRO: OFF DMLC_PATH: 3rdparty/dmlc-core/include INDEX_DEFAULT_I64: ON USE_RELAY_DEBUG: OFF USE_RPC: ON USE_TENSORFLOW_PATH: none TVM_CLML_VERSION: USE_MIOPEN: OFF USE_ROCM: OFF USE_PAPI: OFF USE_CURAND: OFF TVM_CXX_COMPILER_PATH: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ HIDE_PRIVATE_SYMBOLS: OFF

Any other relevant information:

Additional context

MasterJH5574 commented 1 year ago

Hi @ziyu-guo, thanks for reporting. Would you mind confirming the mlc-llm git commit hash again and making sure it is the latest? This issue should have been fixed by https://github.com/mlc-ai/mlc-llm/commit/0ad413d1e706846abdfa0b6262b5bc4cf7add060 which was merged in the past weekend. I followed your command and tried on my machine. It works well on my end.

By running build.py, you are expected to see it printing the following log from your command:

> python3 build.py --hf-path bigcode/starcoderbase --quantization q3f16_0 --max-seq-len 768

Weights exist at dist/models/starcoderbase, skipping download.
Using path "dist/models/starcoderbase" for model "starcoderbase"
Database paths: ['log_db/rwkv-raven-3b', 'log_db/redpajama-3b-q4f16', 'log_db/redpajama-3b-q4f32', 'log_db/rwkv-raven-1b5', 'log_db/dolly-v2-3b', 'log_db/rwkv-raven-7b', 'log_db/vicuna-v1-7b']
[22:45:58] /Users/ruihang-macstudio/Workspace/tvm/src/runtime/metal/metal_device_api.mm:165: Intializing Metal device 0, name=Apple M1 Max
Host CPU dection:
  Target triple: arm64-apple-darwin22.3.0
  Process triple: arm64-apple-darwin22.3.0
  Host CPU: apple-m1
Target configured: metal -keys=metal,gpu -max_function_args=31 -max_num_threads=256 -max_shared_memory_per_block=32768 -max_threads_per_block=1024 -thread_warp_size=32
[22:46:01] /Users/ruihang-macstudio/Workspace/tvm/include/tvm/topi/transform.h:1076: Warning: Fast mode segfaults when there are out-of-bounds indices. Make sure input indices are in bound
[22:46:01] /Users/ruihang-macstudio/Workspace/tvm/include/tvm/topi/transform.h:1076: Warning: Fast mode segfaults when there are out-of-bounds indices. Make sure input indices are in bound
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Host CPU dection:
  Target triple: arm64-apple-darwin22.3.0
  Process triple: arm64-apple-darwin22.3.0
  Host CPU: apple-m1
Automatically using target for weight quantization: metal -keys=metal,gpu -max_function_args=31 -max_num_threads=256 -max_shared_memory_per_block=32768 -max_threads_per_block=1024 -thread_warp_size=32
Start computing and quantizing weights... This may take a while.
Finish computing and quantizing weights.
Total param size: 6.719532012939453 GB
Start storing to cache dist/starcoderbase-q3f16_0/params
[0647/0647] saving param_646
All finished, 151 total shards committed, record saved to dist/starcoderbase-q3f16_0/params/ndarray-cache.json
Save a cached module to dist/starcoderbase-q3f16_0/mod_cache_before_build_metal.pkl.
Finish exporting to dist/starcoderbase-q3f16_0/starcoderbase-q3f16_0-metal.so
Finish exporting chat config to dist/starcoderbase-q3f16_0/params/mlc-chat-config.json

After the fix https://github.com/mlc-ai/mlc-llm/commit/0ad413d1e706846abdfa0b6262b5bc4cf7add060, we now will have the warning printed out as above (highlighted by ^^^). As I don’t see the log you provided contains the warnings, I wonder it might be the matter of git commit hash.

ziyu-guo commented 1 year ago

Hi @MasterJH5574 , thanks for your prompt response. I re-ran the commmand with --use-cache=0, and can confirm the error msg is gone. Thank you for confirming the fix!

Followup question: it seems the performance is somewhat lacking for now: starcoder.cpp using ggml reaches ~7 tokens/s however mlc-chat-cli showed only ~1 token/s. How do I generate the logs needed to speed this model up on M1 GPU? (I've some Ansor tune experience, not familiar with unity-tvm and metaschedule). Any pointers to tooling scripts or code excerpt that remotely helps are welcome!

yzh119 commented 1 year ago

Hi @ziyu-guo , yes the tuning logs for Starcoder are not in the repo. While it's possible to improve performance by tuning models with MetaSchedule, our vision is that users do not need to tune the models on their own.

The dlight feature is coming soon (currently you can try it on dlight branch, it's still in beta stage, currently the quantization mode q4f16_1 has been verified), which generates high-performance GPU kernels without tuning.

MasterJH5574 commented 1 year ago

Hi @ziyu-guo, as @yzh119 mentioned, we are now pushing the efforts of dlight which is expected to bring much better performance out-of-box once released.

I tried dlight on StarCoder just now, but unfortunately dlight has some bug on StarCoder which makes the model outputs meaningless contents. I will take a look recently and hopefully we can resolve it soon :-)

MasterJH5574 commented 1 year ago

After confirmation, the q4f16_1 quantization does not have the correctness issue.

So @ziyu-guo you can try

python3 build.py --model starcoderbase --quantization q4f16_1
mlc_chat_cli --local-id starcoderbase-q4f16_1

on branch dlight. It works smoothly on my side.

Meanwhile I will still take a look into the q4f16_0 correctness issue.

mlc-ai / mlc-llm