mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
19.23k stars 1.58k forks source link

[Bug] Llama 3.2 3B and 1B on MLC are significantly slower than Llama 3.1 8B (L40s, fp16) #2997

Open chrisreese-if opened 3 weeks ago

chrisreese-if commented 3 weeks ago

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

  1. Download the weights for LLama 3.2 1B and 3B from huggingface: https://huggingface.co/mlc-ai/Llama-3.2-1B-Instruct-q0f16-MLC and https://huggingface.co/mlc-ai/Llama-3.2-3B-Instruct-q0f16-MLC
  2. Start mlc server with mode server, call using openAI client, and measure TTFT and decode/sec for 8B, 3B and 1B.

Expected behavior

1B and 3B should have lower TTFT and higher tok/sec, but that is not the case:

MLC 8B: image

MLC 3B: image

MLC 1B: image

These are 95%tile numbers for 50 runs at each settings. When running the same for https://github.com/sgl-project/sglang, we get expected results:

SGLang 8B: image

SGLang 3B: image

SGLang 1B: image

SGLang server was run using: python -m sglang.launch_server --model-path meta-llama/Llama-3.2-1B-Instruct/ --port 8000, so both are unquantized, fp16 with prefix-caching enabled.

Environment

Additional context

MasterJH5574 commented 3 weeks ago

Thank you @chrisreese-if for bringing this up! We will look into this and try to understand the reasons behind.