Open vlbosch opened 5 days ago
Thank you @vlbosch. We also ran into this and get it fixed in #2906. The nightly packages are under built and will be ready in a few hours. I'll report back when the nightly build is done.
Hi @vlbosch the nightly wheel is updated and could you please try upgrade?
@MasterJH5574 Thanks for the quick response! I just updated to the latest nightly and retried. Small draft-mode does work now, however the speed running with small draft is slower than with Mistral Large alone. I thought that the baseline would be the regular speed of the large model? Or does that only count for the other speculative modes like eagle and medusa?
🐛 Bug
I tried to use Mistral Small 7B Instruct v0.3 as draft model for Mistral Large 2407. When not served using "--mode server", the model(s) never respond. I think that's because only CPU is used, instead of GPU as well. When serving with "--mode server", I see that the first token is streamed in the frontend, but then I get the following error: Check failed: (!mstates[i]->draft_output_tokens.empty()) is false.
To Reproduce
Steps to reproduce the behavior:
USER@MBPM3MVLB ~ % python -m mlc_llm serve /Users/USER/LLM/Mistral-Large-Instruct-2407-MLC --additional-models "HF://mlc-ai/Mistral-7B-Instruct-v0.3-q4f16_1-MLC" --speculative-mode small_draft --port 9999 --device metal --mode server [2024-09-16 08:50:13] INFO auto_device.py:79: Found device: metal:0 [2024-09-16 08:50:13] INFO jit.py:43: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY [2024-09-16 08:50:13] INFO jit.py:158: Using cached model lib: /Users/USER/.cache/mlc_llm/model_lib/3826dfed383847636248c8e5e540102b.dylib [2024-09-16 08:50:13] INFO download_cache.py:227: Downloading model from HuggingFace: HF://mlc-ai/Mistral-7B-Instruct-v0.3-q4f16_1-MLC [2024-09-16 08:50:13] INFO download_cache.py:29: MLC_DOWNLOAD_CACHE_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY [2024-09-16 08:50:13] INFO download_cache.py:166: Weights already downloaded: /Users/USER/.cache/mlc_llm/model_weights/hf/mlc-ai/Mistral-7B-Instruct-v0.3-q4f16_1-MLC [2024-09-16 08:50:13] INFO jit.py:43: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY [2024-09-16 08:50:13] INFO jit.py:158: Using cached model lib: /Users/USER/.cache/mlc_llm/model_lib/7bbcaf068957bbf173dbd8ad644faea6.dylib [2024-09-16 08:50:13] INFO engine_base.py:192: The selected engine mode is server. We use as much GPU memory as possible (within the limit of gpu_memory_utilization). [2024-09-16 08:50:13] INFO engine_base.py:200: If you have low concurrent requests and want to use less GPU memory, please select mode "local". [2024-09-16 08:50:13] INFO engine_base.py:205: If you don't have concurrent requests and only use the engine interactively, please select mode "interactive". [08:50:13] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:688: Under mode "local", max batch size will be set to 4, max KV cache token capacity will be set to 8192, prefill chunk size will be set to 2048. [08:50:13] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:688: Under mode "interactive", max batch size will be set to 1, max KV cache token capacity will be set to 32768, prefill chunk size will be set to 2048. [08:50:13] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:688: Under mode "server", max batch size will be set to 80, max KV cache token capacity will be set to 32768, prefill chunk size will be set to 2048. [08:50:13] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:769: The actual engine mode is "server". So max batch size is 80, max KV cache token capacity is 32768, prefill chunk size is 2048. [08:50:13] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:774: Estimated total single GPU memory usage: 86697.674 MB (Parameters: 69664.656 MB. KVCache: 15602.123 MB. Temporary buffer: 1430.894 MB). The actual usage might be slightly larger than the estimated number. [08:50:13] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/engine.cc:365: Warning: Hybrid prefill mode fallbacks to chunked prefill, due to speculative mode is enabled and not implemented with hybrid prefill yet. INFO: Started server process [69315] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://127.0.0.1:9999 (Press CTRL+C to quit) INFO: 127.0.0.1:58406 - "POST /v1/chat/completions HTTP/1.1" 200 OK libc++abi: terminating due to uncaught exception of type tvm::runtime::InternalError: [08:50:41] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/engine_actions/batch_draft.cc:151: InternalError: Check failed: (!mstates[i]->draft_output_tokens.empty()) is false: Stack trace:
zsh: abort python -m mlc_llm serve /Users/USER/LLM/Mistral-Large-Instruct-2407-MLC
Expected behavior
The model streams the output to the provided prompt.
Environment
conda
, source): conda with pip installpip
, source): pippython -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))"
, applicable if you compile models): "USE_NVTX: OFF USE_GTEST: AUTO SUMMARIZE: OFF TVM_DEBUG_WITH_ABI_CHANGE: OFF USE_IOS_RPC: OFF USE_MSC: OFF USE_ETHOSU: CUDA_VERSION: NOT-FOUND USE_LIBBACKTRACE: AUTO DLPACK_PATH: 3rdparty/dlpack/include USE_TENSORRT_CODEGEN: OFF USE_THRUST: OFF USE_TARGET_ONNX: OFF USE_AOT_EXECUTOR: ON BUILD_DUMMY_LIBTVM: OFF USE_CUDNN: OFF USE_TENSORRT_RUNTIME: OFF USE_ARM_COMPUTE_LIB_GRAPH_EXECUTOR: OFF USE_CCACHE: AUTO USE_ARM_COMPUTE_LIB: OFF USE_CPP_RTVM: USE_OPENCL_GTEST: /path/to/opencl/gtest TVM_LOG_BEFORE_THROW: OFF USE_MKL: OFF USE_PT_TVMDSOOP: OFF MLIR_VERSION: NOT-FOUND USE_CLML: OFF USE_STACKVM_RUNTIME: OFF USE_GRAPH_EXECUTOR_CUDA_GRAPH: OFF ROCM_PATH: /opt/rocm USE_DNNL: OFF USE_MSCCL: OFF USE_VITIS_AI: OFF USE_MLIR: OFF USE_RCCL: OFF USE_LLVM: llvm-config --link-static USE_VERILATOR: OFF USE_TF_TVMDSOOP: OFF USE_THREADS: ON USE_MSVC_MT: OFF BACKTRACE_ON_SEGFAULT: OFF USE_GRAPH_EXECUTOR: ON USE_NCCL: OFF USE_ROCBLAS: OFF GIT_COMMIT_HASH: 2685d6ace64c30a077c1b3f6893d2e38589be7bb USE_VULKAN: OFF USE_RUST_EXT: OFF USE_CUTLASS: OFF USE_CPP_RPC: OFF USE_HEXAGON: OFF USE_CUSTOM_LOGGING: OFF USE_UMA: OFF USE_FALLBACK_STL_MAP: OFF USE_SORT: ON USE_RTTI: ON GIT_COMMIT_TIME: 2024-09-07 15:18:06 -0400 USE_HIPBLAS: OFF USE_HEXAGON_SDK: /path/to/sdk USE_BLAS: none USE_ETHOSN: OFF USE_LIBTORCH: OFF USE_RANDOM: ON USE_CUDA: OFF USE_COREML: OFF USE_AMX: OFF BUILD_STATIC_RUNTIME: OFF USE_CMSISNN: OFF USE_KHRONOS_SPIRV: OFF USE_CLML_GRAPH_EXECUTOR: OFF USE_TFLITE: OFF USE_HEXAGON_GTEST: /path/to/hexagon/gtest PICOJSON_PATH: 3rdparty/picojson USE_OPENCL_ENABLE_HOST_PTR: OFF INSTALL_DEV: OFF USE_PROFILER: ON USE_NNPACK: OFF LLVM_VERSION: 17.0.1 USE_MRVL: OFF USE_OPENCL: OFF COMPILER_RT_PATH: 3rdparty/compiler-rt RANG_PATH: 3rdparty/rang/include USE_SPIRV_KHR_INTEGER_DOT_PRODUCT: OFF USE_OPENMP: OFF USE_BNNS: OFF USE_FLASHINFER: USE_CUBLAS: OFF USE_METAL: ON USE_MICRO_STANDALONE_RUNTIME: OFF USE_HEXAGON_EXTERNAL_LIBS: OFF USE_ALTERNATIVE_LINKER: AUTO USE_BYODT_POSIT: OFF USE_NVSHMEM: OFF USE_HEXAGON_RPC: OFF USE_MICRO: OFF DMLC_PATH: 3rdparty/dmlc-core/include INDEX_DEFAULT_I64: ON USE_RELAY_DEBUG: OFF USE_RPC: ON USE_TENSORFLOW_PATH: none TVM_CLML_VERSION: USE_MIOPEN: OFF USE_ROCM: OFF USE_PAPI: OFF USE_CURAND: OFF TVM_CXX_COMPILER_PATH: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ HIDE_PRIVATE_SYMBOLS: ON"Additional context
Both models work fine separately.