Open LHQUer opened 2 months ago
Without CUDA Graph, the communication kernel launching overhead will dominate the latency. The first one is the warm-up graph. The second one is for the iteration right after the warm-up phase. The third one is the rest.
Without CUDA Graph, the communication kernel launching overhead will dominate the latency. The first one is the warm-up graph. The second one is for the iteration right after the warm-up phase. The third one is the rest.
However, according to the official introduction and usage of CUDA_graph, it states that when creating CUDA_graph, it is bound to the current input, which means that in the subsequent steps, if you want to use the previously created CUDA graph, you cannot change the input. However, in the subsequent denoising steps (possibly steps 6-50), when using CUDA_graph, we need to use parameters such as "sample" from the previous time step before calling the forward function for the current denoising step. Does the change in values of variables such as' sample 'not conflict with the input invariance of CUDA graph
Without CUDA Graph, the communication kernel launching overhead will dominate the latency. The first one is the warm-up graph. The second one is for the iteration right after the warm-up phase. The third one is the rest.
and I quote the introduction from the internet —"cuda_graph is suitable for multiple identical calculation operations with the same input invariance so as to accelerate the computing task ”
Without CUDA Graph, the communication kernel launching overhead will dominate the latency. The first one is the warm-up graph. The second one is for the iteration right after the warm-up phase. The third one is the rest.
However, according to the official introduction and usage of CUDA_graph, it states that when creating CUDA_graph, it is bound to the current input, which means that in the subsequent steps, if you want to use the previously created CUDA graph, you cannot change the input. However, in the subsequent denoising steps (possibly steps 6-50), when using CUDA_graph, we need to use parameters such as "sample" from the previous time step before calling the forward function for the current denoising step. Does the change in values of variables such as' sample 'not conflict with the input invariance of CUDA graph
Yes. I bound it to the current input. However, during inference, I copy the new input to the bounded input as shown in distrifuser/models/distri_sdxl_unet_pp.py
use_cuda_graph=True or False, These two diffierent setting will result in different inference speed, Why? How the three cuda_graphs matchs the 50 denosing steps?