pytorch / ao

PyTorch native quantization and sparsity for training and inference
BSD 3-Clause "New" or "Revised" License
1.28k stars 121 forks source link

Any palns for surpporting more conv kernel? #483

Open ZhangZelin-ustc opened 3 months ago

ZhangZelin-ustc commented 3 months ago

To speed up SD model more, will more conv kernel be supported?

msaroufim commented 3 months ago

Hi @ZhangZelin-ustc could you be more specific? Any specific convolution variants you're looking for us to accelerate?

ZhangZelin-ustc commented 3 months ago

@msaroufim Of course, I saw there is a swap_conv2d_1x1_to_linear to process part of conv. But to speed up SD family totally, aren't quant of normal conv1d, conv2d(even conv3d for SVD) in U-net necessary?

gau-nernst commented 2 months ago

From what I know, SD models are compute-bound, so to speed them up, we probably need to use INT8/FP8 conv and/or sparsity (similar optimizations for SAM). Not sure if PyTorch core supports INT8/FP8 conv. Cutlass probably has some e.g. https://github.com/NVIDIA/cutlass/issues/978

supriyar commented 2 months ago

PyTorch core does have an int8 conv operator implementation (via cuDNN backend) here https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/quantized/cudnn/Conv.cpp.

If there is interest, we could potentially hook this up to our quantize_ API call in torchao to see how it performs. @ZhangZelin-ustc do you have a specific use case in mind that you want to speedup with conv?

cc @jerryzh168

ZhangZelin-ustc commented 2 months ago

@supriyar I would be very grateful for that. Because of their widespread use, I think some U-Nets in the 🤗 diffusers library are great cases. For SDXL, unet_2d_condition accounts for the vast majority of computations, while in SVD-xt, UNetSpatioTemporalConditionModel plays this role. A significant proportion of conv layers appear in downsample and upsample in both Unets.

jerryzh168 commented 2 months ago

@supriyar I would be very grateful for that. Because of their widespread use, I think some U-Nets in the 🤗 diffusers library are great cases. For SDXL, unet_2d_condition accounts for the vast majority of computations, while in SVD-xt, UNetSpatioTemporalConditionModel plays this role. A significant proportion of conv layers appear in downsample and upsample in both Unets.

implementing quantized conv kernel is not easy I think, is there any flavor of quantization that you are interested in, that already have proven accuracy results for your use cases? this is talking about: static/weight_only/dynamic, asymmetric/symmetric, bitwidth, per_tensor/per_channel for act/weight etc.

ZhangZelin-ustc commented 2 months ago

@supriyar I would be very grateful for that. Because of their widespread use, I think some U-Nets in the 🤗 diffusers library are great cases. For SDXL, unet_2d_condition accounts for the vast majority of computations, while in SVD-xt, UNetSpatioTemporalConditionModel plays this role. A significant proportion of conv layers appear in downsample and upsample in both Unets.

implementing quantized conv kernel is not easy I think, is there any flavor of quantization that you are interested in, that already have proven accuracy results for your use cases? this is talking about: static/weight_only/dynamic, asymmetric/symmetric, bitwidth, per_tensor/per_channel for act/weight etc.

So far, I have tried using TRT-modelopt to quantify SD and SVD. It provides a very high guarantee of accuracy. But it is partially closed-source, and all I can do is basically configure parameters on top of the interfaces it provides.

I used post-training static quantization (W8A8) to quantize the aforementioned models. Both weights and activations are 8-bit. Pre-channel quantization is applied to the weights, and per-tensor quantization is applied to the activations.It uses calibration to obtain the scale and amax, so I believe it is asymmetric quantization.

In fact, I almost completely configured according to the get_int8_config in this file.

gau-nernst commented 2 months ago

@ZhangZelin-ustc Do you have some end2end benchmark results? I'm curious to see what the expected speedup is if we have INT8 Conv2d in torchao. The links you provide don't seem to provide benchmark results.

ZhangZelin-ustc commented 2 months ago

@gau-nernst Sure, on the NVIDIA L20, following its README, the UNet of SDXL achieves an acceleration ratio of around 1.5, while the acceleration ratio of the UNet in SVD drops to around 1.1, and in our larger-scale models, there is almost no acceleration at all.

We suspect that some strategies of TRT introduce overhead that offsets the acceleration brought by quantization, because looking at the figure below, we can indeed confirm that the kernel-level acceleration ratio matches the theoretical value. However, due to the completely closed-source nature of TRT, there is not much additional we can do.

Note that these acceleration ratios are brought about by quantizing both linear and conv layers.

The figure below is a comparison of a certain conv layer kernel in a standard SDXL after int8 quantization. image

supriyar commented 2 months ago

this might be relevant to the discussion here - HF team has more details on quantizing diffusion model using int8 dynamic quantization using torchao (which involved replacing pointwise convs with linear layers). The speedup is not as high as what you mention by quantizing conv layers though. https://huggingface.co/docs/diffusers/en/tutorials/fast_diffusion#dynamic-quantization cc @HDCharles

ZhangZelin-ustc commented 2 months ago

this might be relevant to the discussion here - HF team has more details on quantizing diffusion model using int8 dynamic quantization using torchao (which involved replacing pointwise convs with linear layers). The speedup is not as high as what you mention by quantizing conv layers though. https://huggingface.co/docs/diffusers/en/tutorials/fast_diffusion#dynamic-quantization cc @HDCharles

In fact, I have tried using it before. It relies on an older version of torchao, but the changes needed are not too many.

An interesting phenomenon is that, contrary to TRT's acceleration ratio decreasing as the model size increases, its acceleration ratio increases with the model size. SD has negative acceleration, SDXL has no acceleration, and SVD has a certain acceleration ratio (seems to be 1.2, if I remember correctly).