Prefill rate degrade between old mlc llm stack and new mlc llm stack via Windows Vulkan

zhongzhenluo commented 3 months ago

The old mlc llm version is based on depreciated stack: the command is still using mlc_chat_cli, the .dll files are based on prebuilt vulkan.dll. On the other hand, then new mlc llm version is based on latest stack: the command is using mlc_llm, the .dll files are generated using mlc_llm compile.

Test on windows 11 laptop (vulkan) Test prompt size: ~1k llama2-7b (old version): prefill ~160 tok/s, decode 7.5 tok/s llama2-7b (latest version): prefill ~80 tok/s, decode: 13.3 tok/s

so decode speed has been improved but why prefill speed has been decreased ? does MLC LLM has any major changes on tokenizer module ? Or compilation process is different ?

Thanks

zhongzhenluo commented 2 months ago

this issue is observed from windows build, and the lagging function is from TVMSynchronize(). Somehow when call TVMSynchronize() in prefill step, the time process this function is much slower than call in decoding phase.

zhongzhenluo commented 2 months ago

actually using previous old prebuilt vulkan dll perf. is different from build from source ourselves. Do we know which date of repo was used to create the previous prebuilt vulkan dll at around July/August 2023 ? I tested with 07-29-2023 repo, the generated dll prefill rate still worse than using the old prebuilt libs

mlc-ai / mlc-llm

Prefill rate degrade between old mlc llm stack and new mlc llm stack via Windows Vulkan #2204