Closed pmeier closed 1 year ago
Concerning elastic
and all the affine transform kernels (affine
, perspective
, rotate
), there are some very limited opportunities for optimization. Perhaps a couple of in-place ops in elastic_transform
& _perspective_grid
and a few optimizations in _apply_grid_transform
(split of mask and img, bilinear fill estimation etc). Also some minor fixes related to the input assertion. @vfdev-5 would you be OK to assess on your side whether it make sense to do these or leave the methods on _FT
to avoid copy-pasting? Perhaps you have in mind other optimizations that I can't see that could affect performance?
Concerning crop
, erase
, pad
, resize
, horizontal_flip
and vertical_flip
, I don't see any further improvements other than the input assertions. It might be worth to have a look on your side, @pmeier and @vfdev-5, incase you see something I don't.
I did another deep dive into the ops in the second paragraph of https://github.com/pytorch/vision/issues/6818#issuecomment-1289154456 and I'm fairly confident that there is little we can do to improve on our side. The only two things I found are
For padding modes "edge"
and "reflect"
we cast to float32
and back:
There is a long standing issue on PyTorch core pytorch/pytorch#40763 that reports this and is assigned to @vfdev-5.
We support "symmetric
" padding in F.pad
, but torch.nn.functional.pad
doesn't. Thus, we have a custom implementation for it
https://github.com/pytorch/vision/blob/c84dbfad97251271a789b252a2a1a52c73f623ff/torchvision/transforms/functional_tensor.py#L330
Since it is written in Python, a possible speed up would be to implement this padding mode in C++ on the PyTorch core side.
Fixing this, we would get speed-ups for padding modes "edge"
, "reflect"
, and "symmetric
" but not for the default and ubiquitous "constant"
padding mode. Skimming the repository, it seems the only time we use non-"constant"
padding is
In there the image is guaranteed to be float and thus would not get any performance boost.
While I think both things mentioned above would be good to have in general, I don't think we should prioritize them.
Concerning elastic and all the affine transform kernels (affine, perspective, rotate), there are some very limited opportunities for optimization. Perhaps a couple of in-place ops in elastic_transform & _perspective_grid and a few optimizations in _apply_grid_transform (split of mask and img, bilinear fill estimation etc). Also some minor fixes related to the input assertion. @vfdev-5 would you be OK to assess on your side whether it make sense to do these or leave the methods on _FT to avoid copy-pasting? Perhaps you have in mind other optimizations that I can't see that could affect performance?
Checking various options with affine
, there is no obvious way to improve runtime perfs. Yes, we can make some inplace "split of mask and img, bilinear fill estimation etc". There is also an open issue about incorrect behaviour of bilinear mode with provided not-None fill (https://github.com/pytorch/vision/issues/6517). Given that I think we can keep this implementation.
About not vectorized bitwise shifts, is there an issue in core?
About not vectorized bitwise shifts, is there an issue in core?
I don't think so, but @alexsamardzic wanted to have a look at it.
Edit: pytorch/pytorch#88607
@pmeier I'm keeping the list up-to-date with all linked PRs. I'm marking as [NEEDS RETEST]/[NEEDS TEST]
any kernel that I touch to run further benchmarks and update the numbers.
An interesting question is whether a sequence of these transformations can be fused with Inductor/Dynamo (or sth else?) and produce a fused low-memory-access CPU kernel (working with uint8 or fp32?) and how it connects with randomness of whether to apply a transform or not
The Transforms V2 API is faster than V1 (stable) because it introduces several optimizations on the Transform Classes and Functional kernels. Summarizing the performance gains on a single number should be taken with a grain of salt because:
With the above in mind, here are some statistics that summarize the performance of the new API:
float32
ops were improved on average by 9% and uint8
by 12%. On PIL backend the performance remains the same.cpu
performance was improved by 23% and cuda
by 29%. On PIL backend the performance remains the same.To estimate the above aggregate statistics we used this script on top of the detailed benchmarks:
For all benchmarks below we use PyTorch nightly 1.14.0.dev20221115
, CUDA 11.6 and TorchVision main from ad128b753c7e8cc0c600dfddac22ff48fc73c9d9. The statistics were estimated on a p4d24xlarge
AWS instance with A100 GPU. Since the both V1 and V2 use the same PyTorch version, the speed improvements below don't include performance optimizations performed on the C++ kernels of Core.
To assess the performance in real world applications, we trained a ResNet50 using TorchVision's SoTA recipe for a reduced number of 10 epochs across different setups:
PYTHONPATH=$PYTHONPATH:`pwd` python -u run_with_submitit.py --ngpus 8 --nodes 1 --model resnet50 --batch-size 128 --lr 0.5 --lr-scheduler cosineannealinglr --lr-warmup-epochs 5 --lr-warmup-method linear --auto-augment ta_wide --epochs 10 --random-erase 0.1 --label-smoothing 0.1 --mixup-alpha 0.2 --cutmix-alpha 1.0 --weight-decay 0.00002 --norm-weight-decay 0.0 --train-crop-size 176 --model-ema --val-resize-size 232 --ra-sampler --ra-reps 4 --data-path /datasets01/imagenet_full_size/061417/
Generated using the following script, inspired from earlier iterations from @vfdev-5 and amended by @pmeier. We compare V1 against V2 for all kernels for many configurations (batch size, dtype, device, number of threads etc) and then estimate the average performance improvement across all configuration to summarize the end result.
Generated using @pmeier's script. We compare V1 against V2 for all kernels for many configurations (batch size, dtype, device, number of threads etc) and then estimate the average performance improvement across all configuration to summarize the end result.
In addition to a lot of other goodies that transforms v2 will bring, we are also actively working on improving the performance. This is a tracker / overview issue of our progress.
Performance was measured with this benchmark script. Unless noted otherwise, the performance improvements reported above were computed on uint8, RGB images and videos while running single-threaded on CPU. You can find the full benchmark results alongside the benchmark script. The results will be constantly updated if new PRs are merged that have an effect on the kernels.
Kernels:
adjust_brightness
#6784adjust_contrast
#6784 #6933adjust_gamma
#6820 #6903adjust_hue
#6805 #6903 #6938adjust_saturation
#6784 #6940adjust_sharpness
#6784 #6930autocontrast
#6811 #6935 #6942equalize
#6738, #6757, #6776invert
#6819posterize
#6823, #6847solarize
#6819affine
#6949center_crop
#6880 #6949crop
#6949elastic
#6942erase
#6983five_crop
: Composite kernel #6949pad
#6949perspective
#6907 #6949resize
#6892resized_crop
: Composite kernel #6892 #6949rotate
#6949ten_crop
: Composite kernel #6949convert_color_space
#6784 #6832convert_dtype
#6795 #6903int
toint
conversion. Currently, we are using a multiplication but theoretically bit shifts are faster. However, on PyTorch core the CPU kernels for bit shifts are not vectorized making them slower for regular sized images than a multiplication. pytorch/pytorch#88607gaussian_blur
#6762 #6888normalize
#6821Transform Classes:
C++ (PyTorch core):
vertical_flip
#6983 https://github.com/pytorch/pytorch/pull/89414horizontal_flip
#6983 https://github.com/pytorch/pytorch/pull/88989 https://github.com/pytorch/pytorch/pull/89414cc @vfdev-5 @datumbox @bjuncek