mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
19.09k stars 1.56k forks source link

[Bug] Pascal cards lost cublas_gemm support in recently builds #1549

Closed alphaarea closed 5 months ago

alphaarea commented 9 months ago

πŸ› Bug

I run mlc-llm on a server with Tesla P100. The last installation of mlc-llm from nightly build with cublas_gemm support was in 12/24/2023.

After entering 2024, all mlc-llm I installed lost cublas_gemm support

TVM Unity Hash Tag at the end

To Reproduce

conda create -n mlc-llm-test python=3.11
conda activate mlc-llm-test
conda install nvidia/label/cuda-12.1.1::cuda
python3 -m pip install --pre -U -f https://mlc.ai/wheels mlc-chat-nightly-cu121 mlc-ai-nightly-cu121

mlc_chat compile --device cuda --opt="cublas_gemm=1;cudagraph=1" --output ./deepseek-llm-67b-chat-q4f
16_1/deepseek-llm-67b-chat-cuda.so ./deepseek-llm-67b-chat-q4f16_1/

Check the compilation parameter in the log

Expected behavior

Compiling with arguments:
  --config          LlamaConfig(hidden_size=8192, intermediate_size=22016, num_attention_heads=64, num_hidden_layers=95, rms_norm_eps=1e-06, vocab_size=102400, position_embedding_base=10000.0, context_window_size=4096, prefill_chunk_size=4096, num_key_value_heads=8, head_dim=128, tensor_parallel_shards=4, max_batch_size=1, kwargs={})
  --quantization    GroupQuantize(name='q4f16_1', kind='group-quant', group_size=32, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float16', num_elem_per_storage=8, num_storage_per_group=4, max_int_value=7)
  --model-type      llama
  --target          {"thread_warp_size": 32, "host": {"mtriple": "x86_64-redhat-linux-gnu", "tag": "", "kind": "llvm", "mcpu": "broadwell", "keys": ["cpu"]}, "arch": "sm_60", "max_threads_per_block": 1024, "max_num_threads": 1024, "kind": "cuda", "max_shared_memory_per_block": 49152, "tag": "", "keys": ["cuda", "gpu"]}
  --opt             flashinfer=0;cublas_gemm=0;cudagraph=1
  --system-lib-prefix ""
  --output          deepseek-llm-67b-chat-q4f16_1/deepseek-llm-67b-chat-cuda.so
  --overrides       context_window_size=None;sliding_window_size=None;prefill_chunk_size=None;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=None

Notice: --opt flashinfer=0;cublas_gemm=0;cudagraph=1

Performance without cublas_gemm:

>>> from mlc_chat import ChatModule
>>> output = cm.benchmark_generate("What's the meaning of life?", generate_length=256)
>>> cm.stats()
'prefill: 35.9 tok/s, decode: 19.2 tok/s'

The following is the output of the old version that supports cublas

Compiling with arguments:
  --config          LlamaConfig(hidden_size=8192, intermediate_size=22016, num_attention_heads=64, num_hidden_layers=95, rms_norm_eps=1e-06, vocab_size=102400, position_embedding_base=10000.0, context_window_size=4096, prefill_chunk_size=4096, num_key_value_heads=8, head_dim=128, tensor_parallel_shards=4, kwargs={'model_type': 'llama', 'quantization': 'q4f16_1', 'model_config': {'hidden_size': 8192, 'intermediate_size': 22016, 'num_attention_heads': 64, 'num_hidden_layers': 95, 'rms_norm_eps': 1e-06, 'vocab_size': 102400, 'position_embedding_base': 10000.0, 'context_window_size': 4096, 'prefill_chunk_size': 4096, 'num_key_value_heads': 8, 'head_dim': 128, 'tensor_parallel_shards': 4}, 'sliding_window_size': -1, 'attention_sink_size': -1, 'mean_gen_len': 128, 'max_gen_len': 512, 'shift_fill_factor': 0.3, 'temperature': 0.7, 'repetition_penalty': 1.0, 'top_p': 0.95, 'conv_template': 'gpt2', 'pad_token_id': 0, 'bos_token_id': 100000, 'eos_token_id': 100001, 'tokenizer_files': ['tokenizer.json', 'tokenizer_config.json'], 'version': '0.1.0'})
  --quantization    GroupQuantize(name='q4f16_1', kind='group-quant', group_size=32, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float16', num_elem_per_storage=8, num_storage_per_group=4, max_int_value=7)
  --model-type      llama
  --target          {"thread_warp_size": 32, "host": {"mtriple": "x86_64-redhat-linux-gnu", "tag": "", "kind": "llvm", "mcpu": "broadwell", "keys": ["cpu"]}, "arch": "sm_60", "max_threads_per_block": 1024, "max_num_threads": 1024, "kind": "cuda", "max_shared_memory_per_block": 49152, "tag": "", "keys": ["cuda", "gpu"]}
  --opt             flashinfer=0;cublas_gemm=1;cudagraph=1
  --system-lib-prefix ""
  --output          deepseek-llm-67b-chat-q4f16_1/deepseek-llm-67b-chat-cuda.so
  --overrides       context_window_size=None;sliding_window_size=None;prefill_chunk_size=None;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=None

Performance with cublas_gemm:

>>> from mlc_chat import ChatModule
>>> output = cm.benchmark_generate("What's the meaning of life?", generate_length=256)
>>> cm.stats()
'prefill: 36.3 tok/s, decode: 19.5 tok/s'

Environment

install in 12/24/2023 with cuBLAS support

USE_NVTX: OFF
USE_GTEST: AUTO
SUMMARIZE: OFF
USE_IOS_RPC: OFF
USE_MSC: OFF
USE_ETHOSU:
CUDA_VERSION: 12.1
USE_LIBBACKTRACE: AUTO
DLPACK_PATH: 3rdparty/dlpack/include
USE_TENSORRT_CODEGEN: OFF
USE_THRUST: OFF
USE_TARGET_ONNX: OFF
USE_AOT_EXECUTOR: ON
BUILD_DUMMY_LIBTVM: OFF
USE_CUDNN: OFF
USE_TENSORRT_RUNTIME: OFF
USE_ARM_COMPUTE_LIB_GRAPH_EXECUTOR: OFF
USE_CCACHE: AUTO
USE_ARM_COMPUTE_LIB: OFF
USE_CPP_RTVM:
USE_OPENCL_GTEST: /path/to/opencl/gtest
USE_MKL: OFF
USE_PT_TVMDSOOP: OFF
MLIR_VERSION: NOT-FOUND
USE_CLML: OFF
USE_STACKVM_RUNTIME: OFF
USE_GRAPH_EXECUTOR_CUDA_GRAPH: OFF
ROCM_PATH: /opt/rocm
USE_DNNL: OFF
USE_VITIS_AI: OFF
USE_MLIR: OFF
USE_RCCL: OFF
USE_LLVM: llvm-config --ignore-libllvm --link-static
USE_VERILATOR: OFF
USE_TF_TVMDSOOP: OFF
USE_THREADS: ON
USE_MSVC_MT: OFF
BACKTRACE_ON_SEGFAULT: OFF
USE_GRAPH_EXECUTOR: ON
USE_NCCL: ON
USE_ROCBLAS: OFF
GIT_COMMIT_HASH: 457f5bc4c94604bbb275465cb64f951f2ecdb3f4
USE_VULKAN: ON
USE_RUST_EXT: OFF
USE_CUTLASS: ON
USE_CPP_RPC: OFF
USE_HEXAGON: OFF
USE_CUSTOM_LOGGING: OFF
USE_UMA: OFF
USE_FALLBACK_STL_MAP: OFF
USE_SORT: ON
USE_RTTI: ON
GIT_COMMIT_TIME: 2023-12-21 10:31:30 -0500
USE_HEXAGON_SDK: /path/to/sdk
USE_BLAS: none
USE_ETHOSN: OFF
USE_LIBTORCH: OFF
USE_RANDOM: ON
USE_CUDA: ON
USE_COREML: OFF
USE_AMX: OFF
BUILD_STATIC_RUNTIME: OFF
USE_CMSISNN: OFF
USE_KHRONOS_SPIRV: OFF
USE_CLML_GRAPH_EXECUTOR: OFF
USE_TFLITE: OFF
USE_HEXAGON_GTEST: /path/to/hexagon/gtest
PICOJSON_PATH: 3rdparty/picojson
USE_OPENCL_ENABLE_HOST_PTR: OFF
INSTALL_DEV: OFF
USE_PROFILER: ON
USE_NNPACK: OFF
LLVM_VERSION: 15.0.7
USE_OPENCL: OFF
COMPILER_RT_PATH: 3rdparty/compiler-rt
RANG_PATH: 3rdparty/rang/include
USE_SPIRV_KHR_INTEGER_DOT_PRODUCT: OFF
USE_OPENMP: OFF
USE_BNNS: OFF
USE_CUBLAS: OFF
USE_METAL: OFF
USE_MICRO_STANDALONE_RUNTIME: OFF
USE_HEXAGON_EXTERNAL_LIBS: OFF
USE_ALTERNATIVE_LINKER: AUTO
USE_BYODT_POSIT: OFF
USE_HEXAGON_RPC: OFF
USE_MICRO: OFF
DMLC_PATH: 3rdparty/dmlc-core/include
INDEX_DEFAULT_I64: ON
USE_RELAY_DEBUG: OFF
USE_RPC: ON
USE_TENSORFLOW_PATH: none
TVM_CLML_VERSION:
USE_MIOPEN: OFF
USE_ROCM: OFF
USE_PAPI: OFF
USE_CURAND: OFF
TVM_CXX_COMPILER_PATH: /opt/rh/gcc-toolset-11/root/usr/bin/c++
HIDE_PRIVATE_SYMBOLS: ON

install in 1/6/2024 without cuBLAS support

USE_NVTX: OFF
USE_GTEST: AUTO
SUMMARIZE: OFF
USE_IOS_RPC: OFF
USE_MSC: OFF
USE_ETHOSU:
CUDA_VERSION: 12.1
USE_LIBBACKTRACE: AUTO
DLPACK_PATH: 3rdparty/dlpack/include
USE_TENSORRT_CODEGEN: OFF
USE_THRUST: OFF
USE_TARGET_ONNX: OFF
USE_AOT_EXECUTOR: ON
BUILD_DUMMY_LIBTVM: OFF
USE_CUDNN: OFF
USE_TENSORRT_RUNTIME: OFF
USE_ARM_COMPUTE_LIB_GRAPH_EXECUTOR: OFF
USE_CCACHE: AUTO
USE_ARM_COMPUTE_LIB: OFF
USE_CPP_RTVM:
USE_OPENCL_GTEST: /path/to/opencl/gtest
USE_MKL: OFF
USE_PT_TVMDSOOP: OFF
MLIR_VERSION: NOT-FOUND
USE_CLML: OFF
USE_STACKVM_RUNTIME: OFF
USE_GRAPH_EXECUTOR_CUDA_GRAPH: OFF
ROCM_PATH: /opt/rocm
USE_DNNL: OFF
USE_VITIS_AI: OFF
USE_MLIR: OFF
USE_RCCL: OFF
USE_LLVM: llvm-config --ignore-libllvm --link-static
USE_VERILATOR: OFF
USE_TF_TVMDSOOP: OFF
USE_THREADS: ON
USE_MSVC_MT: OFF
BACKTRACE_ON_SEGFAULT: OFF
USE_GRAPH_EXECUTOR: ON
USE_NCCL: ON
USE_ROCBLAS: OFF
GIT_COMMIT_HASH: 30c129ef4d43f40c02835a7fa5b485804dbae2c1
USE_VULKAN: ON
USE_RUST_EXT: OFF
USE_CUTLASS: ON
USE_CPP_RPC: OFF
USE_HEXAGON: OFF
USE_CUSTOM_LOGGING: OFF
USE_UMA: OFF
USE_FALLBACK_STL_MAP: OFF
USE_SORT: ON
USE_RTTI: ON
GIT_COMMIT_TIME: 2024-01-01 23:14:28 -0800
USE_HEXAGON_SDK: /path/to/sdk
USE_BLAS: none
USE_ETHOSN: OFF
USE_LIBTORCH: OFF
USE_RANDOM: ON
USE_CUDA: ON
USE_COREML: OFF
USE_AMX: OFF
BUILD_STATIC_RUNTIME: OFF
USE_CMSISNN: OFF
USE_KHRONOS_SPIRV: OFF
USE_CLML_GRAPH_EXECUTOR: OFF
USE_TFLITE: OFF
USE_HEXAGON_GTEST: /path/to/hexagon/gtest
PICOJSON_PATH: 3rdparty/picojson
USE_OPENCL_ENABLE_HOST_PTR: OFF
INSTALL_DEV: OFF
USE_PROFILER: ON
USE_NNPACK: OFF
LLVM_VERSION: 15.0.7
USE_OPENCL: OFF
COMPILER_RT_PATH: 3rdparty/compiler-rt
RANG_PATH: 3rdparty/rang/include
USE_SPIRV_KHR_INTEGER_DOT_PRODUCT: OFF
USE_OPENMP: OFF
USE_BNNS: OFF
USE_CUBLAS: OFF
USE_METAL: OFF
USE_MICRO_STANDALONE_RUNTIME: OFF
USE_HEXAGON_EXTERNAL_LIBS: OFF
USE_ALTERNATIVE_LINKER: AUTO
USE_BYODT_POSIT: OFF
USE_HEXAGON_RPC: OFF
USE_MICRO: OFF
DMLC_PATH: 3rdparty/dmlc-core/include
INDEX_DEFAULT_I64: ON
USE_RELAY_DEBUG: OFF
USE_RPC: ON
USE_TENSORFLOW_PATH: none
TVM_CLML_VERSION:
USE_MIOPEN: OFF
USE_ROCM: OFF
USE_PAPI: OFF
USE_CURAND: OFF
TVM_CXX_COMPILER_PATH: /opt/rh/gcc-toolset-11/root/usr/bin/c++
HIDE_PRIVATE_SYMBOLS: ON
tqchen commented 5 months ago

because of the nightly size issue, we had to set architectures for the nightly to align with latest ones. Building from source might be able to help this situation, closing this for now , feel free to open new ones