Huge performance gap between TVM and TRT on Stable Diffusion v1.5

GPU: Nvidia RTX 3090TI.

Firstly, I use the log db in the repo, it gives me 3.7s to get the result.
Then, I tried to tuning myself using meta-schedule(with trial count set to 50,000), it gives me 2.5s.

But, on TensorRT v8.6, for one iteration of unet, it gives me only 25ms, rather than 96ms with TVM(USE_CUBLAS =ON ; USE_CUDNN =ON; CUDA Version 12.1)

I wonder why the latency gap of stable diffusion model is so huge between TVM and TensorRT. BTW, a few weeks ago, I got a different result between TVM and TRT, where my in-house model auto-tuned by TVM performs a wonderful infer latency (almost nearby TensorRT8.5).

Do you have any ideas about it? Thanks advance.

mlc-ai / web-stable-diffusion

Huge performance gap between TVM and TRT on Stable Diffusion v1.5 #46