[Bug] Llama 3.2 3B and 1B on MLC are significantly slower than Llama 3.1 8B (L40s, fp16)

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

Download the weights for LLama 3.2 1B and 3B from huggingface: https://huggingface.co/mlc-ai/Llama-3.2-1B-Instruct-q0f16-MLC and https://huggingface.co/mlc-ai/Llama-3.2-3B-Instruct-q0f16-MLC
Start mlc server with mode server, call using openAI client, and measure TTFT and decode/sec for 8B, 3B and 1B.

Expected behavior

1B and 3B should have lower TTFT and higher tok/sec, but that is not the case:

MLC 8B:

MLC 3B:

MLC 1B:

These are 95%tile numbers for 50 runs at each settings. When running the same for https://github.com/sgl-project/sglang, we get expected results:

SGLang 8B:

SGLang 3B:

SGLang 1B:

SGLang server was run using: python -m sglang.launch_server --model-path meta-llama/Llama-3.2-1B-Instruct/ --port 8000, so both are unquantized, fp16 with prefix-caching enabled.

Environment

Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): CUDA
Operating system (e.g. Ubuntu/Windows/MacOS/...): Ubuntu
Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...): NVIDIA L40S
How you installed MLC-LLM (conda, source): pip
How you installed TVM-Unity (pip, source): pip
Python version (e.g. 3.10): 3.12

GPU driver version (if applicable):

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

CUDA/cuDNN version (if applicable): 12.4
TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models): N/A
Any other relevant information: N/A

mlc-ai / mlc-llm