mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
18.57k stars 1.5k forks source link

[Question] How to optimize the scheduling of multimodal LLM model convolution in mlc. #2646

Open shifeiwen opened 1 month ago

shifeiwen commented 1 month ago

❓ General Questions

hi,all I'm trying to port Microsoft's Florence-2-large model to mlc recently. It seems to be able to run initially, but I have a problem. Multimodal LLM models usually have convolution layers. I found that when I compiled cuda so and ran it on 3090, it only took 300ms (visvion model) but it took 20s on my 8gen3 phone. I understand that this may be due to the difference in computing power. When I tried to capture the opencl kernel, I found that it took 200ms just for the first convolution. Obviously, this is not normal. mlc depends on tvm. It is impossible for tvm to compile the opencl kernel for conv operations. Can anyone give me some advice? I understand that the llava currently supported by mlc should also have such problems. The image processing is too slow

kernel msg : fused_conv2d_add8 175222 us

shifeiwen commented 1 month ago

image this conv2d kernel on 8gen3 opencl backend It takes 2 seconds to run This is really time-consuming.

Hzfengsy commented 1 month ago

It would be great if you can provide the copyable kernel script :)

shifeiwen commented 1 month ago

conv_kernel.txt Hi @Hzfengsy
This is the kernel script during the compilation process. It includes debug step3.py and debug final.py for the optimization of the same operator in two stages. Thank you

callmehanyu commented 1 month ago

@shifeiwen 大佬,可以出个教程讲下怎么将微软的 Florence-2-base 模型移植到 mlc 上吗?我是个小白,太感谢了~

shifeiwen commented 1 month ago

@callmehanyu You can check out this notebook to follow the above steps. This tutorial is a good example https://github.com/mlc-ai/notebooks/blob/main/mlc-llm/tutorial_add_new_model_architecture_in_tvm_nn_module.ipynb