mit-han-lab / distrifuser

[CVPR 2024 Highlight] DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models
https://hanlab.mit.edu/projects/distrifusion
MIT License
601 stars 24 forks source link

How cuda_graph to acclerate the denosing? #22

Open LHQUer opened 2 months ago

LHQUer commented 2 months ago

use_cuda_graph=True or False, These two diffierent setting will result in different inference speed, Why? How the three cuda_graphs matchs the 50 denosing steps?

lmxyy commented 2 months ago

Without CUDA Graph, the communication kernel launching overhead will dominate the latency. The first one is the warm-up graph. The second one is for the iteration right after the warm-up phase. The third one is the rest.

LHQUer commented 2 months ago

Without CUDA Graph, the communication kernel launching overhead will dominate the latency. The first one is the warm-up graph. The second one is for the iteration right after the warm-up phase. The third one is the rest.

However, according to the official introduction and usage of CUDA_graph, it states that when creating CUDA_graph, it is bound to the current input, which means that in the subsequent steps, if you want to use the previously created CUDA graph, you cannot change the input. However, in the subsequent denoising steps (possibly steps 6-50), when using CUDA_graph, we need to use parameters such as "sample" from the previous time step before calling the forward function for the current denoising step. Does the change in values of variables such as' sample 'not conflict with the input invariance of CUDA graph

LHQUer commented 2 months ago

Without CUDA Graph, the communication kernel launching overhead will dominate the latency. The first one is the warm-up graph. The second one is for the iteration right after the warm-up phase. The third one is the rest.

and I quote the introduction from the internet —"cuda_graph is suitable for multiple identical calculation operations with the same input invariance so as to accelerate the computing task ”

lmxyy commented 1 month ago

Without CUDA Graph, the communication kernel launching overhead will dominate the latency. The first one is the warm-up graph. The second one is for the iteration right after the warm-up phase. The third one is the rest.

However, according to the official introduction and usage of CUDA_graph, it states that when creating CUDA_graph, it is bound to the current input, which means that in the subsequent steps, if you want to use the previously created CUDA graph, you cannot change the input. However, in the subsequent denoising steps (possibly steps 6-50), when using CUDA_graph, we need to use parameters such as "sample" from the previous time step before calling the forward function for the current denoising step. Does the change in values of variables such as' sample 'not conflict with the input invariance of CUDA graph

Yes. I bound it to the current input. However, during inference, I copy the new input to the bounded input as shown in distrifuser/models/distri_sdxl_unet_pp.py