SGLang server was run using: python -m sglang.launch_server --model-path meta-llama/Llama-3.2-1B-Instruct/ --port 8000, so both are unquantized, fp16 with prefix-caching enabled.
Environment
Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): CUDA
Operating system (e.g. Ubuntu/Windows/MacOS/...): Ubuntu
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0
CUDA/cuDNN version (if applicable): 12.4
TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models): N/A
🐛 Bug
To Reproduce
Steps to reproduce the behavior:
Expected behavior
1B and 3B should have lower TTFT and higher tok/sec, but that is not the case:
MLC 8B:
MLC 3B:
MLC 1B:
These are 95%tile numbers for 50 runs at each settings. When running the same for https://github.com/sgl-project/sglang, we get expected results:
SGLang 8B:
SGLang 3B:
SGLang 1B:
SGLang server was run using:
python -m sglang.launch_server --model-path meta-llama/Llama-3.2-1B-Instruct/ --port 8000
, so both are unquantized, fp16 with prefix-caching enabled.Environment
conda
, source): pippip
, source): pippython -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))"
, applicable if you compile models): N/AAdditional context