pmeier commented 2 years ago

In addition to a lot of other goodies that transforms v2 will bring, we are also actively working on improving the performance. This is a tracker / overview issue of our progress.

Performance was measured with this benchmark script. Unless noted otherwise, the performance improvements reported above were computed on uint8, RGB images and videos while running single-threaded on CPU. You can find the full benchmark results alongside the benchmark script. The results will be constantly updated if new PRs are merged that have an effect on the kernels.

Kernels:

color
- [x] adjust_brightness #6784
- [x] adjust_contrast #6784 #6933
- [x] adjust_gamma #6820 #6903
- [x] adjust_hue #6805 #6903 #6938
- [x] adjust_saturation #6784 #6940
- [x] adjust_sharpness #6784 #6930
- [x] autocontrast #6811 #6935 #6942
- [x] equalize #6738, #6757, #6776
- [x] invert #6819
- [x] posterize #6823, #6847
- [x] solarize #6819
geometry
- [x] affine #6949
- [x] center_crop #6880 #6949
- [x] crop #6949
- [x] elastic #6942
- [x] erase #6983
- [x] five_crop: Composite kernel #6949
- [x] pad #6949
- [x] perspective #6907 #6949
- [x] resize #6892
- [x] resized_crop: Composite kernel #6892 #6949
- [x] rotate #6949
- [x] ten_crop: Composite kernel #6949
meta
- [x] convert_color_space #6784 #6832
- [x] convert_dtype #6795 #6903
- There is still some performance gain left for int to int conversion. Currently, we are using a multiplication but theoretically bit shifts are faster. However, on PyTorch core the CPU kernels for bit shifts are not vectorized making them slower for regular sized images than a multiplication. pytorch/pytorch#88607
misc
- [x] gaussian_blur #6762 #6888
- [x] normalize #6821

Transform Classes:

[x] MixUp/CutMix #6835
[x] ColorJitter, RandomPhotometricDistort #6837

C++ (PyTorch core):

[x] vertical_flip #6983 https://github.com/pytorch/pytorch/pull/89414
[x] horizontal_flip #6983 https://github.com/pytorch/pytorch/pull/88989 https://github.com/pytorch/pytorch/pull/89414

cc @vfdev-5 @datumbox @bjuncek

datumbox commented 2 years ago

Concerning elastic and all the affine transform kernels (affine, perspective, rotate), there are some very limited opportunities for optimization. Perhaps a couple of in-place ops in elastic_transform & _perspective_grid and a few optimizations in _apply_grid_transform (split of mask and img, bilinear fill estimation etc). Also some minor fixes related to the input assertion. @vfdev-5 would you be OK to assess on your side whether it make sense to do these or leave the methods on _FT to avoid copy-pasting? Perhaps you have in mind other optimizations that I can't see that could affect performance?

Concerning crop, erase, pad, resize, horizontal_flip and vertical_flip, I don't see any further improvements other than the input assertions. It might be worth to have a look on your side, @pmeier and @vfdev-5, incase you see something I don't.

pmeier commented 2 years ago

I did another deep dive into the ops in the second paragraph of https://github.com/pytorch/vision/issues/6818#issuecomment-1289154456 and I'm fairly confident that there is little we can do to improve on our side. The only two things I found are

For padding modes "edge" and "reflect" we cast to float32 and back:

https://github.com/pytorch/vision/blob/c84dbfad97251271a789b252a2a1a52c73f623ff/torchvision/transforms/functional_tensor.py#L426-L431

There is a long standing issue on PyTorch core pytorch/pytorch#40763 that reports this and is assigned to @vfdev-5.
We support "symmetric" padding in F.pad, but torch.nn.functional.pad doesn't. Thus, we have a custom implementation for it https://github.com/pytorch/vision/blob/c84dbfad97251271a789b252a2a1a52c73f623ff/torchvision/transforms/functional_tensor.py#L330 Since it is written in Python, a possible speed up would be to implement this padding mode in C++ on the PyTorch core side.

Fixing this, we would get speed-ups for padding modes "edge", "reflect", and "symmetric" but not for the default and ubiquitous "constant" padding mode. Skimming the repository, it seems the only time we use non-"constant" padding is

https://github.com/pytorch/vision/blob/c84dbfad97251271a789b252a2a1a52c73f623ff/torchvision/transforms/functional_tensor.py#L762

In there the image is guaranteed to be float and thus would not get any performance boost.

While I think both things mentioned above would be good to have in general, I don't think we should prioritize them.

vfdev-5 commented 2 years ago

Concerning elastic and all the affine transform kernels (affine, perspective, rotate), there are some very limited opportunities for optimization. Perhaps a couple of in-place ops in elastic_transform & _perspective_grid and a few optimizations in _apply_grid_transform (split of mask and img, bilinear fill estimation etc). Also some minor fixes related to the input assertion. @vfdev-5 would you be OK to assess on your side whether it make sense to do these or leave the methods on _FT to avoid copy-pasting? Perhaps you have in mind other optimizations that I can't see that could affect performance?

Checking various options with affine, there is no obvious way to improve runtime perfs. Yes, we can make some inplace "split of mask and img, bilinear fill estimation etc". There is also an open issue about incorrect behaviour of bilinear mode with provided not-None fill (https://github.com/pytorch/vision/issues/6517). Given that I think we can keep this implementation.

vadimkantorov commented 2 years ago

About not vectorized bitwise shifts, is there an issue in core?

pmeier commented 2 years ago

About not vectorized bitwise shifts, is there an issue in core?

I don't think so, but @alexsamardzic wanted to have a look at it.

Edit: pytorch/pytorch#88607

datumbox commented 2 years ago

@pmeier I'm keeping the list up-to-date with all linked PRs. I'm marking as [NEEDS RETEST]/[NEEDS TEST] any kernel that I touch to run further benchmarks and update the numbers.

vadimkantorov commented 2 years ago

An interesting question is whether a sequence of these transformations can be fused with Inductor/Dynamo (or sth else?) and produce a fused low-memory-access CPU kernel (working with uint8 or fp32?) and how it connects with randomness of whether to apply a transform or not

datumbox commented 2 years ago

Speed Benchmarks V1 vs V2

Summary

The Transforms V2 API is faster than V1 (stable) because it introduces several optimizations on the Transform Classes and Functional kernels. Summarizing the performance gains on a single number should be taken with a grain of salt because:

The performance heavily depends on the selected configuration (CPU vs CUDA device, Tensor vs PIL backend, uint8 vs float32 dtypes, number of threads etc). Though we included in our benchmarks the most common configurations, different setups might yield different results.
The execution times of the different Transforms vary significantly (often in orders of magnitude). Though we report % differences, a simple unweighted average can't tell the full story.
The training speed depends on multitude of factors including the mix of augmentations, the size of the model etc. Though we use a commonly used SoTA recipe, the results can differ depending on whether we are IO/Memory/Compute bound.

With the above in mind, here are some statistics that summarize the performance of the new API:

Training: Using TorchVision's latest training recipe, we observe a significant 18% improvement on the training times using the Tensor backend. The performance of PIL backend remains the same.
Transform Classes: The average improvement for the transform classes is about 8%. On the Tensor backend, float32 ops were improved on average by 9% and uint8 by 12%. On PIL backend the performance remains the same.
Functional Kernels: The average improvement for the functional kernels is about 21%. On the Tensor backend, cpu performance was improved by 23% and cuda by 29%. On PIL backend the performance remains the same.

To estimate the above aggregate statistics we used this script on top of the detailed benchmarks:

Aggregate Statistics

``` TRANSFORMS: Overall execution time reduction: -8.37% % device dtype cpu float32 -7.47 pil -0.10 uint8 -11.61 cuda float32 -8.43 uint8 -13.47 ---------------------------- DISPATCHERS: Overall execution time reduction: -21.49% % device dtype cpu float32 -21.31 pil -3.26 uint8 -24.21 cuda float32 -29.09 uint8 -29.43 ---------------------------- ```

Speed Benchmarks

For all benchmarks below we use PyTorch nightly 1.14.0.dev20221115, CUDA 11.6 and TorchVision main from ad128b753c7e8cc0c600dfddac22ff48fc73c9d9. The statistics were estimated on a p4d24xlarge AWS instance with A100 GPU. Since the both V1 and V2 use the same PyTorch version, the speed improvements below don't include performance optimizations performed on the C++ kernels of Core.

Training

To assess the performance in real world applications, we trained a ResNet50 using TorchVision's SoTA recipe for a reduced number of 10 epochs across different setups:

PYTHONPATH=$PYTHONPATH:`pwd` python -u run_with_submitit.py --ngpus 8 --nodes 1 --model resnet50 --batch-size 128 --lr 0.5 --lr-scheduler cosineannealinglr --lr-warmup-epochs 5 --lr-warmup-method linear --auto-augment ta_wide --epochs 10 --random-erase 0.1 --label-smoothing 0.1 --mixup-alpha 0.2 --cutmix-alpha 1.0 --weight-decay 0.00002 --norm-weight-decay 0.0 --train-crop-size 176 --model-ema --val-resize-size 232 --ra-sampler --ra-reps 4 --data-path /datasets01/imagenet_full_size/061417/

Detailed Benchmarks

V1 using ad128b753c7e8cc0c600dfddac22ff48fc73c9d9 of main branch (PIL): ``` Submitted job_id: 77904 Epoch: [0] Total time: 0:03:07 Epoch: [1] Total time: 0:03:04 Epoch: [2] Total time: 0:03:03 Epoch: [3] Total time: 0:03:03 Epoch: [4] Total time: 0:03:02 Epoch: [5] Total time: 0:03:03 Epoch: [6] Total time: 0:03:03 Epoch: [7] Total time: 0:03:02 Epoch: [8] Total time: 0:03:00 Epoch: [9] Total time: 0:03:05 ``` V1 using 46bd6d914b0d9f7352a41f39eff5c114ac7a2538 of #6952 (Tensor uint8): ``` Submitted job_id: 77827 Epoch: [0] Total time: 0:03:43 Epoch: [1] Total time: 0:04:05 Epoch: [2] Total time: 0:03:59 Epoch: [3] Total time: 0:04:24 Epoch: [4] Total time: 0:04:39 Epoch: [5] Total time: 0:04:42 Epoch: [6] Total time: 0:04:46 Epoch: [7] Total time: 0:04:42 Epoch: [8] Total time: 0:03:40 Epoch: [9] Total time: 0:03:32 ``` V2 using 8b530360cd5bfc82e66d7c12035c630592d3444f of #6433 (PIL). Marginal median improvement of 1.64%: ``` Submitted job_id: 77905 Epoch: [0] Total time: 0:03:09 Epoch: [1] Total time: 0:03:02 Epoch: [2] Total time: 0:03:00 Epoch: [3] Total time: 0:03:00 Epoch: [4] Total time: 0:03:00 Epoch: [5] Total time: 0:02:59 Epoch: [6] Total time: 0:03:00 Epoch: [7] Total time: 0:03:00 Epoch: [8] Total time: 0:03:00 Epoch: [9] Total time: 0:03:00 ``` V2 using bda072de631f5790cbb6383696494687580d16db of #6433 (Tensor uint8). Median improvement of 18.27%: ``` Submitted job_id: 77901 Epoch: [0] Total time: 0:03:52 Epoch: [1] Total time: 0:03:36 Epoch: [2] Total time: 0:03:35 Epoch: [3] Total time: 0:03:31 Epoch: [4] Total time: 0:03:28 Epoch: [5] Total time: 0:03:28 Epoch: [6] Total time: 0:03:28 Epoch: [7] Total time: 0:03:26 Epoch: [8] Total time: 0:03:27 Epoch: [9] Total time: 0:03:25 ``` V2 using 8f07159b8f1b6a6e961f0613431b7adcfafbee82 of #6433 (Tensor float32). Note that this configuration wasn't supported in V1 because not all kernels and augmentations supported floats: ``` Submitted job_id: 77902 Epoch: [0] Total time: 0:04:25 Epoch: [1] Total time: 0:04:13 Epoch: [2] Total time: 0:04:12 Epoch: [3] Total time: 0:04:12 Epoch: [4] Total time: 0:04:13 Epoch: [5] Total time: 0:04:10 Epoch: [6] Total time: 0:04:11 Epoch: [7] Total time: 0:04:11 Epoch: [8] Total time: 0:04:12 Epoch: [9] Total time: 0:04:11 ```

Transform Classes

Generated using the following script, inspired from earlier iterations from @vfdev-5 and amended by @pmeier. We compare V1 against V2 for all kernels for many configurations (batch size, dtype, device, number of threads etc) and then estimate the average performance improvement across all configuration to summarize the end result.

Detailed Benchmarks

``` [-------------------------------- RandomErasing ---------------------------------] | V1 | V2 1 threads: ----------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 359 (+- 92) us | 333 (+- 2) us cuda torch.float32 (3, 400, 400) | 322 (+- 2) us | 331 (+- 3) us cpu torch.float32 (16, 3, 400, 400) | 4995 (+- 80) us | 4978 (+- 54) us cuda torch.float32 (16, 3, 400, 400) | 2144 (+-102) us | 2135 (+-102) us cpu torch.uint8 (3, 400, 400) | 219 (+- 1) us | 226 (+- 2) us cuda torch.uint8 (3, 400, 400) | 227 (+- 2) us | 236 (+- 2) us cpu torch.uint8 (16, 3, 400, 400) | 1787 (+- 44) us | 1789 (+- 42) us cuda torch.uint8 (16, 3, 400, 400) | 1313 (+- 55) us | 1316 (+- 56) us 6 threads: ----------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 410 (+- 4) us | 418 (+- 3) us cpu torch.float32 (16, 3, 400, 400) | 5191 (+- 78) us | 5225 (+- 61) us cpu torch.uint8 (3, 400, 400) | 302 (+- 3) us | 310 (+- 4) us cpu torch.uint8 (16, 3, 400, 400) | 1973 (+- 40) us | 1977 (+- 49) us Times are in microseconds (us). Performance of V1 vs V2: -1.228% (slowdown) [---------------------------------- AugMix ----------------------------------] | V1 | V2 1 threads: ------------------------------------------------------------------- cpu torch.uint8 (3, 400, 400) | 22 (+- 7) ms | 19 (+- 2) ms cuda torch.uint8 (3, 400, 400) | 2 (+- 1) ms | 2 (+- 0) ms cpu torch.uint8 (16, 3, 400, 400) | 736 (+-262) ms | 738 (+-234) ms cuda torch.uint8 (16, 3, 400, 400) | 10 (+- 3) ms | 3 (+- 0) ms cpu pil (3, 400, 400) | 25 (+- 3) ms | 23 (+- 2) ms 6 threads: ------------------------------------------------------------------- cpu torch.uint8 (3, 400, 400) | 27 (+- 3) ms | 23 (+- 3) ms cpu torch.uint8 (16, 3, 400, 400) | 803 (+-271) ms | 735 (+-240) ms Times are in milliseconds (ms). Performance of V1 vs V2: 21.496% (improvement) [------------------------------------ AutoAugment ------------------------------------] | V1 | V2 1 threads: ---------------------------------------------------------------------------- cpu torch.uint8 (3, 400, 400) | 3478 (+-253) us | 2952 (+-251) us cuda torch.uint8 (3, 400, 400) | 746 (+- 27) us | 317 (+- 6) us cpu torch.uint8 (16, 3, 400, 400) | 103178 (+-19894) us | 87614 (+-27733) us cuda torch.uint8 (16, 3, 400, 400) | 6868 (+-671) us | 635 (+- 18) us cpu pil (3, 400, 400) | 1194 (+-133) us | 1153 (+- 31) us 6 threads: ---------------------------------------------------------------------------- cpu torch.uint8 (3, 400, 400) | 4128 (+-269) us | 3366 (+-278) us cpu torch.uint8 (16, 3, 400, 400) | 72148 (+-94797) us | 89567 (+-30107) us Times are in microseconds (us). Performance of V1 vs V2: 30.867% (improvement) [------------------------------------- RandAugment --------------------------------------] | V1 | V2 1 threads: ------------------------------------------------------------------------------- cpu torch.uint8 (3, 400, 400) | 6604 (+-1089) us | 5928 (+-275) us cuda torch.uint8 (3, 400, 400) | 798 (+- 14) us | 574 (+- 10) us cpu torch.uint8 (16, 3, 400, 400) | 172182 (+-119305) us | 162579 (+-110068) us cuda torch.uint8 (16, 3, 400, 400) | 2982 (+-580) us | 945 (+- 47) us cpu pil (3, 400, 400) | 2036 (+-149) us | 1933 (+-147) us 6 threads: ------------------------------------------------------------------------------- cpu torch.uint8 (3, 400, 400) | 7738 (+-1201) us | 6920 (+-1190) us cpu torch.uint8 (16, 3, 400, 400) | 180085 (+-119892) us | 163626 (+-115677) us Times are in microseconds (us). Performance of V1 vs V2: 20.997% (improvement) [--------------------------------- TrivialAugmentWide ---------------------------------] | V1 | V2 1 threads: ----------------------------------------------------------------------------- cpu torch.uint8 (3, 400, 400) | 3387 (+-329) us | 3081 (+-321) us cuda torch.uint8 (3, 400, 400) | 451 (+- 13) us | 297 (+- 8) us cpu torch.uint8 (16, 3, 400, 400) | 101788 (+-91224) us | 89224 (+-87124) us cuda torch.uint8 (16, 3, 400, 400) | 1578 (+-373) us | 501 (+- 19) us cpu pil (3, 400, 400) | 1133 (+-137) us | 1062 (+-138) us 6 threads: ----------------------------------------------------------------------------- cpu torch.uint8 (3, 400, 400) | 4069 (+-355) us | 3618 (+-361) us cpu torch.uint8 (16, 3, 400, 400) | 102527 (+-100556) us | 91264 (+-90662) us Times are in microseconds (us). Performance of V1 vs V2: 22.838% (improvement) [------------------------------------- ColorJitter --------------------------------------] | V1 | V2 1 threads: ------------------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 24640 (+-766) us | 16808 (+-187) us cuda torch.float32 (3, 400, 400) | 1071 (+- 36) us | 791 (+- 33) us cpu torch.float32 (16, 3, 400, 400) | 899045 (+-18215) us | 495452 (+-23080) us cuda torch.float32 (16, 3, 400, 400) | 6444 (+- 6) us | 2648 (+- 1) us cpu torch.uint8 (3, 400, 400) | 26271 (+-237) us | 18410 (+-126) us cuda torch.uint8 (3, 400, 400) | 1200 (+- 9) us | 887 (+- 5) us cpu torch.uint8 (16, 3, 400, 400) | 938875 (+-13454) us | 534734 (+-12761) us cuda torch.uint8 (16, 3, 400, 400) | 6657 (+- 1) us | 2942 (+- 0) us cpu pil (3, 400, 400) | 14835 (+-410) us | 14801 (+-402) us 6 threads: ------------------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 29098 (+-352) us | 20871 (+-425) us cpu torch.float32 (16, 3, 400, 400) | 914067 (+-20531) us | 528114 (+-15384) us cpu torch.uint8 (3, 400, 400) | 31858 (+-314) us | 23345 (+-330) us cpu torch.uint8 (16, 3, 400, 400) | 946323 (+-17617) us | 523300 (+-14203) us Times are in microseconds (us). Performance of V1 vs V2: 31.440% (improvement) [-------------------------------- RandomAdjustSharpness ---------------------------------] | V1 | V2 1 threads: ------------------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 5130 (+- 24) us | 4030 (+- 74) us cuda torch.float32 (3, 400, 400) | 187 (+- 1) us | 147 (+- 1) us cpu torch.float32 (16, 3, 400, 400) | 202595 (+-768) us | 185755 (+-6337) us cuda torch.float32 (16, 3, 400, 400) | 489 (+- 1) us | 382 (+- 1) us cpu torch.uint8 (3, 400, 400) | 5564 (+- 39) us | 4288 (+- 19) us cuda torch.uint8 (3, 400, 400) | 222 (+- 1) us | 157 (+- 1) us cpu torch.uint8 (16, 3, 400, 400) | 217870 (+-6078) us | 191308 (+-4504) us cuda torch.uint8 (16, 3, 400, 400) | 578 (+- 1) us | 458 (+- 1) us cpu pil (3, 400, 400) | 3561 (+- 16) us | 3581 (+- 12) us 6 threads: ------------------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 6139 (+- 47) us | 4912 (+- 44) us cpu torch.float32 (16, 3, 400, 400) | 220111 (+-8890) us | 201278 (+-1894) us cpu torch.uint8 (3, 400, 400) | 6848 (+- 41) us | 5268 (+- 52) us cpu torch.uint8 (16, 3, 400, 400) | 235867 (+-27195) us | 207550 (+-20399) us Times are in microseconds (us). Performance of V1 vs V2: 15.608% (improvement) [------------------------------- RandomAutocontrast -------------------------------] | V1 | V2 1 threads: ------------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 721 (+- 1) us | 572 (+- 3) us cuda torch.float32 (3, 400, 400) | 177 (+- 20) us | 117 (+- 1) us cpu torch.float32 (16, 3, 400, 400) | 18869 (+-343) us | 14033 (+- 95) us cuda torch.float32 (16, 3, 400, 400) | 239 (+- 0) us | 222 (+- 0) us cpu torch.uint8 (3, 400, 400) | 1144 (+- 8) us | 809 (+- 5) us cuda torch.uint8 (3, 400, 400) | 177 (+- 1) us | 132 (+- 1) us cpu torch.uint8 (16, 3, 400, 400) | 24274 (+-155) us | 13676 (+-130) us cuda torch.uint8 (16, 3, 400, 400) | 256 (+- 5) us | 273 (+- 0) us cpu pil (3, 400, 400) | 747 (+- 2) us | 767 (+- 1) us 6 threads: ------------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 943 (+- 19) us | 791 (+- 23) us cpu torch.float32 (16, 3, 400, 400) | 19014 (+-248) us | 14404 (+-359) us cpu torch.uint8 (3, 400, 400) | 1460 (+- 15) us | 1112 (+- 32) us cpu torch.uint8 (16, 3, 400, 400) | 25074 (+-235) us | 14291 (+-235) us Times are in microseconds (us). Performance of V1 vs V2: 17.171% (improvement) [--------------------------------- RandomEqualize --------------------------------] | V1 | V2 1 threads: ------------------------------------------------------------------------ cpu torch.uint8 (3, 400, 400) | 2913 (+- 12) us | 2411 (+- 11) us cuda torch.uint8 (3, 400, 400) | 978 (+-306) us | 288 (+- 1) us cpu torch.uint8 (16, 3, 400, 400) | 47271 (+-185) us | 40238 (+-157) us cuda torch.uint8 (16, 3, 400, 400) | 14421 (+-1185) us | 826 (+- 1) us cpu pil (3, 400, 400) | 756 (+- 2) us | 776 (+- 1) us 6 threads: ------------------------------------------------------------------------ cpu torch.uint8 (3, 400, 400) | 3649 (+- 38) us | 2615 (+- 28) us cpu torch.uint8 (16, 3, 400, 400) | 59636 (+-1869) us | 40607 (+-454) us Times are in microseconds (us). Performance of V1 vs V2: 34.185% (improvement) [--------------------------------- RandomInvert ---------------------------------] | V1 | V2 1 threads: ----------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 187 (+- 1) us | 195 (+- 1) us cuda torch.float32 (3, 400, 400) | 20 (+- 0) us | 28 (+- 0) us cpu torch.float32 (16, 3, 400, 400) | 4103 (+- 33) us | 4096 (+- 25) us cuda torch.float32 (16, 3, 400, 400) | 49 (+- 0) us | 49 (+- 0) us cpu torch.uint8 (3, 400, 400) | 164 (+- 1) us | 50 (+- 0) us cuda torch.uint8 (3, 400, 400) | 20 (+- 0) us | 25 (+- 0) us cpu torch.uint8 (16, 3, 400, 400) | 2282 (+- 19) us | 627 (+- 1) us cuda torch.uint8 (16, 3, 400, 400) | 20 (+- 0) us | 25 (+- 0) us cpu pil (3, 400, 400) | 327 (+- 1) us | 346 (+- 1) us 6 threads: ----------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 234 (+- 3) us | 242 (+- 3) us cpu torch.float32 (16, 3, 400, 400) | 4412 (+- 56) us | 4392 (+- 34) us cpu torch.uint8 (3, 400, 400) | 208 (+- 3) us | 95 (+- 2) us cpu torch.uint8 (16, 3, 400, 400) | 2352 (+- 31) us | 684 (+- 6) us Times are in microseconds (us). Performance of V1 vs V2: 3.451% (improvement) [------------------------------ RandomPosterize -------------------------------] | V1 | V2 1 threads: --------------------------------------------------------------------- cpu torch.uint8 (3, 400, 400) | 127 (+- 1) us | 136 (+- 1) us cuda torch.uint8 (3, 400, 400) | 20 (+- 0) us | 28 (+- 0) us cpu torch.uint8 (16, 3, 400, 400) | 1636 (+- 7) us | 1642 (+- 20) us cuda torch.uint8 (16, 3, 400, 400) | 20 (+- 0) us | 28 (+- 0) us cpu pil (3, 400, 400) | 334 (+- 1) us | 354 (+- 2) us 6 threads: --------------------------------------------------------------------- cpu torch.uint8 (3, 400, 400) | 169 (+- 2) us | 178 (+- 2) us cpu torch.uint8 (16, 3, 400, 400) | 1700 (+- 17) us | 1708 (+- 26) us Times are in microseconds (us). Performance of V1 vs V2: -16.203% (slowdown) [--------------------------------- RandomSolarize ---------------------------------] | V1 | V2 1 threads: ------------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 957 (+- 6) us | 961 (+- 6) us cuda torch.float32 (3, 400, 400) | 41 (+- 0) us | 50 (+- 0) us cpu torch.float32 (16, 3, 400, 400) | 17249 (+- 98) us | 17450 (+-231) us cuda torch.float32 (16, 3, 400, 400) | 157 (+- 0) us | 159 (+- 0) us cpu torch.uint8 (3, 400, 400) | 1081 (+- 7) us | 976 (+- 8) us cuda torch.uint8 (3, 400, 400) | 40 (+- 0) us | 44 (+- 0) us cpu torch.uint8 (16, 3, 400, 400) | 18245 (+-111) us | 16818 (+-108) us cuda torch.uint8 (16, 3, 400, 400) | 60 (+- 0) us | 62 (+- 0) us cpu pil (3, 400, 400) | 333 (+- 1) us | 353 (+- 1) us 6 threads: ------------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 1104 (+- 20) us | 1107 (+- 20) us cpu torch.float32 (16, 3, 400, 400) | 17576 (+-205) us | 17469 (+-322) us cpu torch.uint8 (3, 400, 400) | 1249 (+- 20) us | 1139 (+- 82) us cpu torch.uint8 (16, 3, 400, 400) | 18673 (+-280) us | 17263 (+-308) us Times are in microseconds (us). Performance of V1 vs V2: -3.361% (slowdown) [--------------------------------- CenterCrop ---------------------------------] | V1 | V2 1 threads: --------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 11 (+- 0) us | 9 (+- 0) us cuda torch.float32 (3, 400, 400) | 12 (+- 0) us | 10 (+- 0) us cpu torch.float32 (16, 3, 400, 400) | 12 (+- 0) us | 10 (+- 0) us cuda torch.float32 (16, 3, 400, 400) | 12 (+- 0) us | 10 (+- 0) us cpu torch.uint8 (3, 400, 400) | 11 (+- 0) us | 9 (+- 0) us cuda torch.uint8 (3, 400, 400) | 12 (+- 0) us | 10 (+- 0) us cpu torch.uint8 (16, 3, 400, 400) | 11 (+- 0) us | 9 (+- 0) us cuda torch.uint8 (16, 3, 400, 400) | 12 (+- 0) us | 10 (+- 0) us cpu pil (3, 400, 400) | 17 (+- 0) us | 16 (+- 0) us 6 threads: --------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 11 (+- 0) us | 9 (+- 0) us cpu torch.float32 (16, 3, 400, 400) | 12 (+- 0) us | 9 (+- 0) us cpu torch.uint8 (3, 400, 400) | 11 (+- 0) us | 9 (+- 0) us cpu torch.uint8 (16, 3, 400, 400) | 12 (+- 0) us | 9 (+- 0) us Times are in microseconds (us). Performance of V1 vs V2: 15.328% (improvement) [------------------------------ ElasticTransform ------------------------------] | V1 | V2 1 threads: --------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 100 (+- 1) ms | 100 (+- 1) ms cuda torch.float32 (3, 400, 400) | 96 (+- 1) ms | 96 (+- 1) ms cpu torch.float32 (16, 3, 400, 400) | 181 (+- 4) ms | 166 (+- 2) ms cuda torch.float32 (16, 3, 400, 400) | 97 (+- 1) ms | 96 (+- 1) ms cpu torch.uint8 (3, 400, 400) | 101 (+- 1) ms | 100 (+- 1) ms cuda torch.uint8 (3, 400, 400) | 96 (+- 1) ms | 96 (+- 1) ms cpu torch.uint8 (16, 3, 400, 400) | 193 (+- 5) ms | 176 (+- 2) ms cuda torch.uint8 (16, 3, 400, 400) | 97 (+- 2) ms | 96 (+- 1) ms cpu pil (3, 400, 400) | 104 (+- 1) ms | 103 (+- 1) ms 6 threads: --------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 103 (+- 2) ms | 101 (+- 2) ms cpu torch.float32 (16, 3, 400, 400) | 184 (+- 2) ms | 170 (+- 3) ms cpu torch.uint8 (3, 400, 400) | 103 (+- 1) ms | 102 (+- 1) ms cpu torch.uint8 (16, 3, 400, 400) | 197 (+- 2) ms | 181 (+- 2) ms Times are in milliseconds (ms). Performance of V1 vs V2: 2.308% (improvement) [---------------------------------- FiveCrop ----------------------------------] | V1 | V2 1 threads: --------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 40 (+- 0) us | 27 (+- 0) us cuda torch.float32 (3, 400, 400) | 42 (+- 0) us | 29 (+- 0) us cpu torch.float32 (16, 3, 400, 400) | 40 (+- 0) us | 28 (+- 0) us cuda torch.float32 (16, 3, 400, 400) | 42 (+- 0) us | 29 (+- 0) us cpu torch.uint8 (3, 400, 400) | 40 (+- 0) us | 27 (+- 0) us cuda torch.uint8 (3, 400, 400) | 41 (+- 0) us | 28 (+- 0) us cpu torch.uint8 (16, 3, 400, 400) | 40 (+- 0) us | 27 (+- 0) us cuda torch.uint8 (16, 3, 400, 400) | 42 (+- 0) us | 29 (+- 0) us cpu pil (3, 400, 400) | 111 (+- 1) us | 106 (+- 0) us 6 threads: --------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 40 (+- 0) us | 28 (+- 0) us cpu torch.float32 (16, 3, 400, 400) | 40 (+- 0) us | 28 (+- 0) us cpu torch.uint8 (3, 400, 400) | 40 (+- 0) us | 27 (+- 0) us cpu torch.uint8 (16, 3, 400, 400) | 40 (+- 0) us | 27 (+- 0) us Times are in microseconds (us). Performance of V1 vs V2: 26.077% (improvement) [------------------------------------- Pad --------------------------------------] | V1 | V2 1 threads: ----------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 287 (+- 1) us | 297 (+- 1) us cuda torch.float32 (3, 400, 400) | 27 (+- 0) us | 34 (+- 0) us cpu torch.float32 (16, 3, 400, 400) | 6883 (+- 43) us | 6922 (+- 63) us cuda torch.float32 (16, 3, 400, 400) | 79 (+- 0) us | 79 (+- 0) us cpu torch.uint8 (3, 400, 400) | 220 (+- 1) us | 230 (+- 1) us cuda torch.uint8 (3, 400, 400) | 27 (+- 0) us | 34 (+- 0) us cpu torch.uint8 (16, 3, 400, 400) | 3231 (+- 20) us | 3249 (+- 11) us cuda torch.uint8 (16, 3, 400, 400) | 38 (+- 0) us | 38 (+- 0) us cpu pil (3, 400, 400) | 147 (+- 1) us | 151 (+- 1) us 6 threads: ----------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 376 (+- 3) us | 388 (+- 3) us cpu torch.float32 (16, 3, 400, 400) | 6969 (+-193) us | 7031 (+- 63) us cpu torch.uint8 (3, 400, 400) | 302 (+- 3) us | 314 (+- 3) us cpu torch.uint8 (16, 3, 400, 400) | 3354 (+- 25) us | 3379 (+- 35) us Times are in microseconds (us). Performance of V1 vs V2: -6.993% (slowdown) [------------------------------------- Resize -------------------------------------] | V1 | V2 1 threads: ------------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 1096 (+- 7) us | 1103 (+- 7) us cuda torch.float32 (3, 400, 400) | 23 (+- 0) us | 25 (+- 0) us cpu torch.float32 (16, 3, 400, 400) | 16734 (+-116) us | 16712 (+- 95) us cuda torch.float32 (16, 3, 400, 400) | 162 (+- 1) us | 162 (+- 0) us cpu torch.uint8 (3, 400, 400) | 1391 (+- 8) us | 1370 (+- 9) us cuda torch.uint8 (3, 400, 400) | 51 (+- 0) us | 53 (+- 0) us cpu torch.uint8 (16, 3, 400, 400) | 22197 (+-127) us | 22000 (+-143) us cuda torch.uint8 (16, 3, 400, 400) | 229 (+- 0) us | 228 (+- 0) us cpu pil (3, 400, 400) | 1124 (+- 5) us | 1125 (+- 7) us 6 threads: ------------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 1186 (+- 22) us | 1191 (+- 20) us cpu torch.float32 (16, 3, 400, 400) | 16956 (+-317) us | 16976 (+-184) us cpu torch.uint8 (3, 400, 400) | 1608 (+- 21) us | 1586 (+- 26) us cpu torch.uint8 (16, 3, 400, 400) | 22713 (+-290) us | 22526 (+-420) us Times are in microseconds (us). Performance of V1 vs V2: -1.247% (slowdown) [----------------------------------- TenCrop ------------------------------------] | V1 | V2 1 threads: ----------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 324 (+- 1) us | 295 (+- 1) us cuda torch.float32 (3, 400, 400) | 99 (+- 1) us | 66 (+- 1) us cpu torch.float32 (16, 3, 400, 400) | 6064 (+-163) us | 5996 (+- 19) us cuda torch.float32 (16, 3, 400, 400) | 99 (+- 1) us | 66 (+- 0) us cpu torch.uint8 (3, 400, 400) | 386 (+- 1) us | 357 (+- 2) us cuda torch.uint8 (3, 400, 400) | 98 (+- 1) us | 66 (+- 0) us cpu torch.uint8 (16, 3, 400, 400) | 4660 (+- 13) us | 4626 (+- 17) us cuda torch.uint8 (16, 3, 400, 400) | 99 (+- 1) us | 67 (+- 0) us cpu pil (3, 400, 400) | 356 (+- 1) us | 328 (+- 1) us 6 threads: ----------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 374 (+- 3) us | 344 (+- 4) us cpu torch.float32 (16, 3, 400, 400) | 6064 (+- 66) us | 6027 (+- 57) us cpu torch.uint8 (3, 400, 400) | 433 (+- 4) us | 403 (+- 4) us cpu torch.uint8 (16, 3, 400, 400) | 4741 (+- 36) us | 4709 (+- 40) us Times are in microseconds (us). Performance of V1 vs V2: 16.425% (improvement) [------------------------------------- RandomAffine -------------------------------------] | V1 | V2 1 threads: ------------------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 14378 (+-906) us | 14169 (+-104) us cuda torch.float32 (3, 400, 400) | 555 (+- 32) us | 514 (+- 2) us cpu torch.float32 (16, 3, 400, 400) | 453405 (+-30956) us | 456598 (+-31138) us cuda torch.float32 (16, 3, 400, 400) | 1584 (+- 15) us | 1579 (+- 10) us cpu torch.uint8 (3, 400, 400) | 14589 (+-319) us | 14540 (+-305) us cuda torch.uint8 (3, 400, 400) | 550 (+- 1) us | 543 (+- 2) us cpu torch.uint8 (16, 3, 400, 400) | 472450 (+-33505) us | 463440 (+-32430) us cuda torch.uint8 (16, 3, 400, 400) | 1685 (+- 9) us | 1677 (+- 10) us cpu pil (3, 400, 400) | 359 (+- 1) us | 365 (+- 2) us 6 threads: ------------------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 14949 (+-351) us | 14951 (+-326) us cpu torch.float32 (16, 3, 400, 400) | 458052 (+-32305) us | 457932 (+-31797) us cpu torch.uint8 (3, 400, 400) | 15542 (+-337) us | 15445 (+-329) us cpu torch.uint8 (16, 3, 400, 400) | 470605 (+-33002) us | 468819 (+-33084) us Times are in microseconds (us). Performance of V1 vs V2: 0.725% (improvement) [---------------------------------- RandomCrop ----------------------------------] | V1 | V2 1 threads: ----------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 323 (+- 1) us | 335 (+- 1) us cuda torch.float32 (3, 400, 400) | 55 (+- 0) us | 62 (+- 0) us cpu torch.float32 (16, 3, 400, 400) | 6914 (+-112) us | 6929 (+- 34) us cuda torch.float32 (16, 3, 400, 400) | 79 (+- 0) us | 79 (+- 0) us cpu torch.uint8 (3, 400, 400) | 250 (+- 1) us | 262 (+- 1) us cuda torch.uint8 (3, 400, 400) | 54 (+- 0) us | 61 (+- 0) us cpu torch.uint8 (16, 3, 400, 400) | 3241 (+- 8) us | 3259 (+- 16) us cuda torch.uint8 (16, 3, 400, 400) | 48 (+- 0) us | 62 (+- 0) us cpu pil (3, 400, 400) | 203 (+- 1) us | 208 (+- 11) us 6 threads: ----------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 416 (+- 4) us | 428 (+- 4) us cpu torch.float32 (16, 3, 400, 400) | 7003 (+-236) us | 7069 (+- 66) us cpu torch.uint8 (3, 400, 400) | 337 (+- 4) us | 349 (+- 4) us cpu torch.uint8 (16, 3, 400, 400) | 3375 (+- 36) us | 3395 (+- 34) us Times are in microseconds (us). Performance of V1 vs V2: -6.462% (slowdown) [----------------------------- RandomHorizontalFlip -----------------------------] | V1 | V2 1 threads: ----------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 247 (+- 1) us | 254 (+- 1) us cuda torch.float32 (3, 400, 400) | 23 (+- 0) us | 27 (+- 0) us cpu torch.float32 (16, 3, 400, 400) | 5931 (+- 48) us | 5919 (+- 15) us cuda torch.float32 (16, 3, 400, 400) | 51 (+- 0) us | 51 (+- 0) us cpu torch.uint8 (3, 400, 400) | 309 (+- 1) us | 316 (+- 1) us cuda torch.uint8 (3, 400, 400) | 23 (+- 0) us | 27 (+- 0) us cpu torch.uint8 (16, 3, 400, 400) | 4574 (+- 9) us | 4582 (+- 9) us cuda torch.uint8 (16, 3, 400, 400) | 23 (+- 0) us | 27 (+- 0) us cpu pil (3, 400, 400) | 133 (+- 1) us | 138 (+- 1) us 6 threads: ----------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 296 (+- 2) us | 304 (+- 3) us cpu torch.float32 (16, 3, 400, 400) | 5932 (+- 76) us | 5928 (+- 38) us cpu torch.uint8 (3, 400, 400) | 354 (+- 3) us | 360 (+- 4) us cpu torch.uint8 (16, 3, 400, 400) | 4647 (+- 35) us | 4654 (+- 44) us Times are in microseconds (us). Performance of V1 vs V2: -5.806% (slowdown) [---------------------------------- RandomPerspective -----------------------------------] | V1 | V2 1 threads: ------------------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 5669 (+-309) us | 5016 (+- 14) us cuda torch.float32 (3, 400, 400) | 668 (+- 2) us | 638 (+- 1) us cpu torch.float32 (16, 3, 400, 400) | 103699 (+-11683) us | 87578 (+-11757) us cuda torch.float32 (16, 3, 400, 400) | 872 (+- 11) us | 852 (+- 6) us cpu torch.uint8 (3, 400, 400) | 6140 (+- 17) us | 5418 (+- 14) us cuda torch.uint8 (3, 400, 400) | 707 (+- 2) us | 672 (+- 1) us cpu torch.uint8 (16, 3, 400, 400) | 115905 (+-11269) us | 96945 (+-11355) us cuda torch.uint8 (16, 3, 400, 400) | 915 (+- 8) us | 897 (+- 8) us cpu pil (3, 400, 400) | 3385 (+- 40) us | 3410 (+- 40) us 6 threads: ------------------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 6295 (+- 50) us | 5589 (+- 48) us cpu torch.float32 (16, 3, 400, 400) | 106728 (+-12443) us | 90306 (+-12163) us cpu torch.uint8 (3, 400, 400) | 6919 (+- 64) us | 6111 (+- 39) us cpu torch.uint8 (16, 3, 400, 400) | 118305 (+-11773) us | 100258 (+-11661) us Times are in microseconds (us). Performance of V1 vs V2: 6.612% (improvement) [-------------------------------- RandomResizedCrop ---------------------------------] | V1 | V2 1 threads: --------------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 845 (+- 20) us | 835 (+- 20) us cuda torch.float32 (3, 400, 400) | 108 (+- 1) us | 97 (+- 1) us cpu torch.float32 (16, 3, 400, 400) | 11057 (+-922) us | 11051 (+-924) us cuda torch.float32 (16, 3, 400, 400) | 119 (+- 2) us | 119 (+- 1) us cpu torch.uint8 (3, 400, 400) | 1053 (+- 23) us | 1014 (+- 23) us cuda torch.uint8 (3, 400, 400) | 134 (+- 1) us | 122 (+- 1) us cpu torch.uint8 (16, 3, 400, 400) | 14512 (+-1163) us | 14136 (+-1084) us cuda torch.uint8 (16, 3, 400, 400) | 130 (+- 1) us | 129 (+- 1) us cpu pil (3, 400, 400) | 902 (+- 70) us | 885 (+- 6) us 6 threads: --------------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 945 (+- 30) us | 934 (+- 31) us cpu torch.float32 (16, 3, 400, 400) | 11308 (+-967) us | 11291 (+-956) us cpu torch.uint8 (3, 400, 400) | 1270 (+- 29) us | 1230 (+- 36) us cpu torch.uint8 (16, 3, 400, 400) | 14894 (+-1149) us | 14507 (+-1140) us Times are in microseconds (us). Performance of V1 vs V2: 3.028% (improvement) [------------------------------------- RandomRotation -------------------------------------] | V1 | V2 1 threads: --------------------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 17180 (+-4965) us | 15761 (+-1568) us cuda torch.float32 (3, 400, 400) | 656 (+- 2) us | 624 (+- 1) us cpu torch.float32 (16, 3, 400, 400) | 458672 (+-160941) us | 430656 (+-28398) us cuda torch.float32 (16, 3, 400, 400) | 1581 (+- 41) us | 1571 (+- 42) us cpu torch.uint8 (3, 400, 400) | 16548 (+-1619) us | 16330 (+-1549) us cuda torch.uint8 (3, 400, 400) | 693 (+- 1) us | 656 (+- 1) us cpu torch.uint8 (16, 3, 400, 400) | 477884 (+-173543) us | 449887 (+-28933) us cuda torch.uint8 (16, 3, 400, 400) | 1737 (+- 47) us | 1746 (+- 51) us cpu pil (3, 400, 400) | 611 (+- 4) us | 615 (+- 4) us 6 threads: --------------------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 16987 (+-1742) us | 16902 (+-1664) us cpu torch.float32 (16, 3, 400, 400) | 464165 (+-160255) us | 463919 (+-159642) us cpu torch.uint8 (3, 400, 400) | 17776 (+-1622) us | 17486 (+-1536) us cpu torch.uint8 (16, 3, 400, 400) | 481863 (+-176986) us | 476903 (+-166256) us Times are in microseconds (us). Performance of V1 vs V2: 1.887% (improvement) [------------------------------ RandomVerticalFlip ------------------------------] | V1 | V2 1 threads: ----------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 184 (+- 1) us | 192 (+- 1) us cuda torch.float32 (3, 400, 400) | 23 (+- 0) us | 27 (+- 0) us cpu torch.float32 (16, 3, 400, 400) | 4606 (+- 37) us | 4602 (+- 19) us cuda torch.float32 (16, 3, 400, 400) | 52 (+- 0) us | 52 (+- 0) us cpu torch.uint8 (3, 400, 400) | 91 (+- 1) us | 98 (+- 1) us cuda torch.uint8 (3, 400, 400) | 23 (+- 0) us | 26 (+- 0) us cpu torch.uint8 (16, 3, 400, 400) | 1089 (+- 10) us | 1097 (+- 9) us cuda torch.uint8 (16, 3, 400, 400) | 24 (+- 0) us | 27 (+- 1) us cpu pil (3, 400, 400) | 74 (+- 0) us | 80 (+- 0) us 6 threads: ----------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 229 (+- 2) us | 237 (+- 2) us cpu torch.float32 (16, 3, 400, 400) | 4706 (+- 87) us | 4707 (+- 37) us cpu torch.uint8 (3, 400, 400) | 134 (+- 2) us | 142 (+- 2) us cpu torch.uint8 (16, 3, 400, 400) | 1152 (+- 19) us | 1161 (+- 17) us Times are in microseconds (us). Performance of V1 vs V2: -6.617% (slowdown) [------------------------------- ConvertImageDtype --------------------------------] | V1 | V2 1 threads: ------------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 134 (+- 1) us | 132 (+- 1) us cuda torch.float32 (3, 400, 400) | 16 (+- 0) us | 14 (+- 0) us cpu torch.float32 (16, 3, 400, 400) | 2519 (+- 20) us | 2514 (+- 16) us cuda torch.float32 (16, 3, 400, 400) | 44 (+- 0) us | 44 (+- 0) us cpu torch.uint8 (3, 400, 400) | 1053 (+- 5) us | 981 (+- 6) us cuda torch.uint8 (3, 400, 400) | 25 (+- 0) us | 22 (+- 0) us cpu torch.uint8 (16, 3, 400, 400) | 16495 (+- 62) us | 15196 (+- 62) us cuda torch.uint8 (16, 3, 400, 400) | 52 (+- 0) us | 40 (+- 0) us 6 threads: ------------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 180 (+- 2) us | 177 (+- 3) us cpu torch.float32 (16, 3, 400, 400) | 2619 (+- 34) us | 2617 (+- 39) us cpu torch.uint8 (3, 400, 400) | 1139 (+- 16) us | 1062 (+- 14) us cpu torch.uint8 (16, 3, 400, 400) | 16690 (+-252) us | 15337 (+-256) us Times are in microseconds (us). Performance of V1 vs V2: 7.949% (improvement) [------------------------------------- GaussianBlur -------------------------------------] | V1 | V2 1 threads: ------------------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 3281 (+-260) us | 3174 (+-258) us cuda torch.float32 (3, 400, 400) | 239 (+- 31) us | 140 (+- 1) us cpu torch.float32 (16, 3, 400, 400) | 241303 (+-59097) us | 241166 (+-58982) us cuda torch.float32 (16, 3, 400, 400) | 305 (+- 1) us | 221 (+- 0) us cpu torch.uint8 (3, 400, 400) | 3896 (+-239) us | 3657 (+-246) us cuda torch.uint8 (3, 400, 400) | 257 (+- 2) us | 171 (+- 1) us cpu torch.uint8 (16, 3, 400, 400) | 256446 (+-2638) us | 254117 (+-796) us cuda torch.uint8 (16, 3, 400, 400) | 433 (+- 1) us | 344 (+- 0) us cpu pil (3, 400, 400) | 7085 (+-303) us | 6921 (+-282) us 6 threads: ------------------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 4452 (+-266) us | 4315 (+-262) us cpu torch.float32 (16, 3, 400, 400) | 264282 (+-2007) us | 264110 (+-2584) us cpu torch.uint8 (3, 400, 400) | 5110 (+-257) us | 4934 (+-258) us cpu torch.uint8 (16, 3, 400, 400) | 279173 (+-2179) us | 276032 (+-3026) us Times are in microseconds (us). Performance of V1 vs V2: 13.555% (improvement) [---------------------------------- Normalize -----------------------------------] | V1 | V2 1 threads: ----------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 383 (+- 1) us | 291 (+- 1) us cuda torch.float32 (3, 400, 400) | 118 (+- 1) us | 74 (+- 0) us cpu torch.float32 (16, 3, 400, 400) | 6943 (+- 19) us | 5478 (+- 55) us cuda torch.float32 (16, 3, 400, 400) | 224 (+- 1) us | 140 (+- 1) us 6 threads: ----------------------------------------------------------------------- cpu torch.float32 (3, 400, 400) | 516 (+- 4) us | 380 (+- 4) us cpu torch.float32 (16, 3, 400, 400) | 7282 (+- 58) us | 6006 (+- 49) us Times are in microseconds (us). Performance of V1 vs V2: 30.002% (improvement) ```

Functional Kernels

Generated using @pmeier's script. We compare V1 against V2 for all kernels for many configurations (batch size, dtype, device, number of threads etc) and then estimate the average performance improvement across all configuration to summarize the end result.

Detailed Benchmarks

``` [----------- adjust_brightness @ torchvision==0.15.0a0+1098dad ------------] | v1 | v2 1 threads: ----------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 1303 (+- 5) | 572 (+- 1) (3, 400, 400) / uint8 / cuda | 53 (+- 0) | 25 (+- 0) (3, 400, 400) / PIL | 814 (+- 2) | 811 (+- 2) (3, 400, 400) / float32 / cpu | 830 (+- 4) | 253 (+- 1) (3, 400, 400) / float32 / cuda | 44 (+- 0) | 17 (+- 0) (16, 3, 400, 400) / uint8 / cpu | 31009 (+-836) | 12549 (+- 61) (16, 3, 400, 400) / uint8 / cuda | 261 (+- 0) | 127 (+- 0) (16, 3, 400, 400) / float32 / cpu | 23236 (+-539) | 5201 (+- 31) (16, 3, 400, 400) / float32 / cuda | 241 (+- 0) | 96 (+- 0) 6 threads: ----------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 1650 (+- 26) | 744 (+- 16) (3, 400, 400) / float32 / cpu | 1050 (+- 18) | 339 (+- 4) (16, 3, 400, 400) / uint8 / cpu | 31815 (+-396) | 12900 (+-247) (16, 3, 400, 400) / float32 / cpu | 23572 (+-309) | 5473 (+- 48) Times are in microseconds (us). Aggregated performance change of v2 vs. v1: +48.8% (improvement) [------------- adjust_contrast @ torchvision==0.15.0a0+1098dad -------------] | v1 | v2 1 threads: ------------------------------------------------------------------ (3, 400, 400) / uint8 / cpu | 1380 (+- 18) | 954 (+- 4) (3, 400, 400) / uint8 / cuda | 134 (+- 2) | 82 (+- 1) (3, 400, 400) / PIL | 1081 (+- 8) | 1077 (+- 5) (3, 400, 400) / float32 / cpu | 905 (+- 12) | 540 (+- 1) (3, 400, 400) / float32 / cuda | 107 (+- 2) | 66 (+- 0) (16, 3, 400, 400) / uint8 / cpu | 35265 (+-122) | 24302 (+-125) (16, 3, 400, 400) / uint8 / cuda | 293 (+- 0) | 242 (+- 0) (16, 3, 400, 400) / float32 / cpu | 23713 (+-109) | 13330 (+- 86) (16, 3, 400, 400) / float32 / cuda | 252 (+- 0) | 197 (+- 0) 6 threads: ------------------------------------------------------------------ (3, 400, 400) / uint8 / cpu | 2046 (+- 30) | 1501 (+- 20) (3, 400, 400) / float32 / cpu | 1290 (+- 23) | 841 (+- 20) (16, 3, 400, 400) / uint8 / cpu | 38422 (+-1500) | 25921 (+-293) (16, 3, 400, 400) / float32 / cpu | 23309 (+-428) | 13884 (+-180) Times are in microseconds (us). Aggregated performance change of v2 vs. v1: +26.0% (improvement) [--------------- adjust_gamma @ torchvision==0.15.0a0+1098dad ---------------] | v1 | v2 1 threads: ------------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 4640 (+- 12) | 4362 (+- 8) (3, 400, 400) / uint8 / cuda | 81 (+- 1) | 53 (+- 0) (3, 400, 400) / PIL | 463 (+- 1) | 457 (+- 1) (3, 400, 400) / float32 / cpu | 3789 (+- 17) | 3641 (+- 8) (3, 400, 400) / float32 / cuda | 29 (+- 0) | 21 (+- 0) (16, 3, 400, 400) / uint8 / cpu | 82220 (+-634) | 72850 (+-394) (16, 3, 400, 400) / uint8 / cuda | 331 (+- 0) | 312 (+- 0) (16, 3, 400, 400) / float32 / cpu | 63453 (+-496) | 58586 (+-298) (16, 3, 400, 400) / float32 / cuda | 150 (+- 0) | 142 (+- 0) 6 threads: ------------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 5042 (+- 47) | 4751 (+- 29) (3, 400, 400) / float32 / cpu | 4003 (+- 46) | 3866 (+- 36) (16, 3, 400, 400) / uint8 / cpu | 83791 (+-3026) | 75086 (+-1903) (16, 3, 400, 400) / float32 / cpu | 65180 (+-1774) | 60190 (+-1816) Times are in microseconds (us). Aggregated performance change of v2 vs. v1: +10.4% (improvement) [------------------ adjust_hue @ torchvision==0.15.0a0+1098dad ------------------] | v1 | v2 1 threads: ----------------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 21044 (+- 98) | 15068 (+-119) (3, 400, 400) / uint8 / cuda | 904 (+- 47) | 558 (+- 1) (3, 400, 400) / PIL | 10260 (+- 54) | 10253 (+- 59) (3, 400, 400) / float32 / cpu | 20317 (+-132) | 14494 (+-171) (3, 400, 400) / float32 / cuda | 693 (+- 2) | 510 (+- 2) (16, 3, 400, 400) / uint8 / cpu | 806291 (+-26060) | 458049 (+-1384) (16, 3, 400, 400) / uint8 / cuda | 5805 (+- 1) | 2331 (+- 0) (16, 3, 400, 400) / float32 / cpu | 769409 (+-9376) | 442955 (+-5570) (16, 3, 400, 400) / float32 / cuda | 5633 (+- 1) | 2170 (+- 3) 6 threads: ----------------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 24682 (+-213) | 18549 (+-318) (3, 400, 400) / float32 / cpu | 23809 (+-217) | 17842 (+-238) (16, 3, 400, 400) / uint8 / cpu | 799018 (+-14984) | 467291 (+-6339) (16, 3, 400, 400) / float32 / cpu | 781586 (+-2532) | 451900 (+-13095) Times are in microseconds (us). Aggregated performance change of v2 vs. v1: +32.7% (improvement) [----------- adjust_saturation @ torchvision==0.15.0a0+1098dad ------------] | v1 | v2 1 threads: ----------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 1462 (+- 7) | 976 (+- 5) (3, 400, 400) / uint8 / cuda | 104 (+- 1) | 65 (+- 0) (3, 400, 400) / PIL | 932 (+- 43) | 929 (+- 6) (3, 400, 400) / float32 / cpu | 986 (+- 3) | 570 (+- 1) (3, 400, 400) / float32 / cuda | 86 (+- 0) | 50 (+- 0) (16, 3, 400, 400) / uint8 / cpu | 35543 (+-128) | 25018 (+- 88) (16, 3, 400, 400) / uint8 / cuda | 287 (+- 0) | 241 (+- 0) (16, 3, 400, 400) / float32 / cpu | 24104 (+-435) | 13323 (+-223) (16, 3, 400, 400) / float32 / cuda | 262 (+- 0) | 199 (+- 0) 6 threads: ----------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 2117 (+- 19) | 1462 (+- 18) (3, 400, 400) / float32 / cpu | 1344 (+- 19) | 802 (+- 18) (16, 3, 400, 400) / uint8 / cpu | 38526 (+-345) | 26558 (+-327) (16, 3, 400, 400) / float32 / cpu | 25124 (+-212) | 13919 (+-218) Times are in microseconds (us). Aggregated performance change of v2 vs. v1: +26.9% (improvement) [-------------- adjust_sharpness @ torchvision==0.15.0a0+1098dad --------------] | v1 | v2 1 threads: --------------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 5508 (+-167) | 4232 (+- 36) (3, 400, 400) / uint8 / cuda | 205 (+- 1) | 115 (+- 0) (3, 400, 400) / PIL | 3532 (+- 11) | 3523 (+- 8) (3, 400, 400) / float32 / cpu | 4955 (+- 34) | 4040 (+- 39) (3, 400, 400) / float32 / cuda | 170 (+- 1) | 104 (+- 0) (16, 3, 400, 400) / uint8 / cpu | 286173 (+-5670) | 258815 (+-769) (16, 3, 400, 400) / uint8 / cuda | 575 (+- 1) | 455 (+- 0) (16, 3, 400, 400) / float32 / cpu | 270322 (+-7024) | 252958 (+-710) (16, 3, 400, 400) / float32 / cuda | 487 (+- 1) | 380 (+- 3) 6 threads: --------------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 6857 (+-193) | 5214 (+- 62) (3, 400, 400) / float32 / cpu | 6000 (+- 47) | 4875 (+- 58) (16, 3, 400, 400) / uint8 / cpu | 306676 (+-3053) | 279421 (+-2652) (16, 3, 400, 400) / float32 / cpu | 291542 (+-2580) | 274144 (+-2823) Times are in microseconds (us). Aggregated performance change of v2 vs. v1: +18.4% (improvement) [------------------- affine @ torchvision==0.15.0a0+1098dad -------------------] | v1 | v2 1 threads: --------------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 2130 (+-821) | 2021 (+-816) (3, 400, 400) / uint8 / cuda | 245 (+- 1) | 221 (+- 2) (3, 400, 400) / PIL | 3700 (+-1658) | 3689 (+-1658) (3, 400, 400) / float32 / cpu | 1609 (+-795) | 1590 (+-793) (3, 400, 400) / float32 / cuda | 212 (+- 1) | 194 (+- 1) (16, 3, 400, 400) / uint8 / cpu | 70314 (+-11520) | 68234 (+-11541) (16, 3, 400, 400) / uint8 / cuda | 383 (+- 33) | 373 (+- 34) (16, 3, 400, 400) / float32 / cpu | 58610 (+-11623) | 29207 (+-14296) (16, 3, 400, 400) / float32 / cuda | 256 (+- 34) | 249 (+- 34) 6 threads: --------------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 2455 (+-806) | 2349 (+-808) (3, 400, 400) / float32 / cpu | 1802 (+-803) | 1776 (+-804) (16, 3, 400, 400) / uint8 / cpu | 71781 (+-11712) | 69805 (+-12170) (16, 3, 400, 400) / float32 / cpu | 58848 (+-11868) | 58908 (+-11957) Times are in microseconds (us). Aggregated performance change of v2 vs. v1: +3.4% (improvement) [-------------- autocontrast @ torchvision==0.15.0a0+1098dad --------------] | v1 | v2 1 threads: ----------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 1130 (+- 4) | 767 (+- 2) (3, 400, 400) / uint8 / cuda | 163 (+- 1) | 96 (+- 1) (3, 400, 400) / PIL | 724 (+- 1) | 720 (+- 2) (3, 400, 400) / float32 / cpu | 712 (+- 1) | 532 (+- 1) (3, 400, 400) / float32 / cuda | 135 (+- 1) | 81 (+- 0) (16, 3, 400, 400) / uint8 / cpu | 23736 (+-103) | 13385 (+- 61) (16, 3, 400, 400) / uint8 / cuda | 254 (+- 0) | 270 (+- 0) (16, 3, 400, 400) / float32 / cpu | 18042 (+- 83) | 13180 (+-131) (16, 3, 400, 400) / float32 / cuda | 237 (+- 0) | 221 (+- 0) 6 threads: ----------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 1447 (+- 32) | 1066 (+- 17) (3, 400, 400) / float32 / cpu | 931 (+- 17) | 745 (+- 14) (16, 3, 400, 400) / uint8 / cpu | 24379 (+-436) | 13972 (+-231) (16, 3, 400, 400) / float32 / cpu | 18798 (+-735) | 13514 (+-260) Times are in microseconds (us). Aggregated performance change of v2 vs. v1: +20.9% (improvement) [------------ center_crop @ torchvision==0.15.0a0+1098dad -------------] | v1 | v2 1 threads: ------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 11 (+- 0) | 5 (+- 0) (3, 400, 400) / uint8 / cuda | 11 (+- 0) | 5 (+- 0) (3, 400, 400) / PIL | 15 (+- 0) | 11 (+- 0) (3, 400, 400) / float32 / cpu | 11 (+- 0) | 5 (+- 0) (3, 400, 400) / float32 / cuda | 11 (+- 0) | 5 (+- 0) (16, 3, 400, 400) / uint8 / cpu | 11 (+- 0) | 5 (+- 0) (16, 3, 400, 400) / uint8 / cuda | 11 (+- 0) | 6 (+- 0) (16, 3, 400, 400) / float32 / cpu | 11 (+- 0) | 5 (+- 0) (16, 3, 400, 400) / float32 / cuda | 11 (+- 0) | 6 (+- 0) 6 threads: ------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 11 (+- 0) | 5 (+- 0) (3, 400, 400) / float32 / cpu | 11 (+- 0) | 5 (+- 0) (16, 3, 400, 400) / uint8 / cpu | 11 (+- 0) | 5 (+- 0) (16, 3, 400, 400) / float32 / cpu | 11 (+- 0) | 5 (+- 0) Times are in microseconds (us). Aggregated performance change of v2 vs. v1: +45.7% (improvement) [---------- convert_color_space @ torchvision==0.15.0a0+1098dad ----------] | v1 | v2 1 threads: ---------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 485 (+- 1) | 310 (+- 1) (3, 400, 400) / uint8 / cuda | 58 (+- 0) | 37 (+- 0) (3, 400, 400) / PIL | 112 (+- 0) | 111 (+- 1) (3, 400, 400) / float32 / cpu | 357 (+- 1) | 165 (+- 0) (3, 400, 400) / float32 / cuda | 49 (+- 0) | 29 (+- 0) (16, 3, 400, 400) / uint8 / cpu | 9109 (+- 57) | 5401 (+- 10) (16, 3, 400, 400) / uint8 / cuda | 83 (+- 1) | 57 (+- 0) (16, 3, 400, 400) / float32 / cpu | 6209 (+- 12) | 2763 (+- 9) (16, 3, 400, 400) / float32 / cuda | 84 (+- 1) | 44 (+- 0) 6 threads: ---------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 854 (+- 19) | 594 (+- 14) (3, 400, 400) / float32 / cpu | 562 (+- 4) | 291 (+- 4) (16, 3, 400, 400) / uint8 / cpu | 10484 (+-233) | 6001 (+- 39) (16, 3, 400, 400) / float32 / cpu | 6770 (+-116) | 3014 (+- 21) Times are in microseconds (us). Aggregated performance change of v2 vs. v1: +34.4% (improvement) [----------- convert_dtype @ torchvision==0.15.0a0+1098dad ------------] | v1 | v2 1 threads: ------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 283 (+- 1) | 189 (+- 1) (3, 400, 400) / uint8 / cuda | 24 (+- 0) | 16 (+- 0) (16, 3, 400, 400) / uint8 / cpu | 6775 (+-124) | 3609 (+- 11) (16, 3, 400, 400) / uint8 / cuda | 90 (+- 1) | 87 (+- 1) 6 threads: ------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 373 (+- 3) | 274 (+- 3) (16, 3, 400, 400) / uint8 / cpu | 6845 (+-183) | 3781 (+- 37) Times are in microseconds (us). Aggregated performance change of v2 vs. v1: +28.8% (improvement) [---------------- crop @ torchvision==0.15.0a0+1098dad ----------------] | v1 | v2 1 threads: ------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 6 (+- 0) | 4 (+- 0) (3, 400, 400) / uint8 / cuda | 7 (+- 0) | 4 (+- 0) (3, 400, 400) / PIL | 11 (+- 0) | 10 (+- 0) (3, 400, 400) / float32 / cpu | 6 (+- 0) | 4 (+- 0) (3, 400, 400) / float32 / cuda | 7 (+- 0) | 4 (+- 0) (16, 3, 400, 400) / uint8 / cpu | 7 (+- 0) | 4 (+- 0) (16, 3, 400, 400) / uint8 / cuda | 7 (+- 0) | 5 (+- 0) (16, 3, 400, 400) / float32 / cpu | 6 (+- 0) | 4 (+- 0) (16, 3, 400, 400) / float32 / cuda | 7 (+- 0) | 5 (+- 0) 6 threads: ------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 6 (+- 0) | 4 (+- 0) (3, 400, 400) / float32 / cpu | 6 (+- 0) | 4 (+- 0) (16, 3, 400, 400) / uint8 / cpu | 6 (+- 0) | 4 (+- 0) (16, 3, 400, 400) / float32 / cpu | 6 (+- 0) | 4 (+- 0) Times are in microseconds (us). Aggregated performance change of v2 vs. v1: +29.2% (improvement) [----------------- elastic @ torchvision==0.15.0a0+1098dad ------------------] | v1 | v2 1 threads: ------------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 4316 (+- 17) | 4150 (+- 15) (3, 400, 400) / uint8 / cuda | 1004 (+- 5) | 479 (+- 1) (3, 400, 400) / PIL | 6821 (+- 16) | 6664 (+- 18) (3, 400, 400) / float32 / cpu | 3823 (+- 15) | 3722 (+- 13) (3, 400, 400) / float32 / cuda | 972 (+- 5) | 455 (+- 1) (16, 3, 400, 400) / uint8 / cpu | 84065 (+-2585) | 83337 (+-1763) (16, 3, 400, 400) / uint8 / cuda | 1051 (+- 6) | 493 (+- 1) (16, 3, 400, 400) / float32 / cpu | 73713 (+-268) | 73115 (+-1222) (16, 3, 400, 400) / float32 / cuda | 975 (+- 6) | 448 (+- 1) 6 threads: ------------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 4590 (+- 29) | 4426 (+- 30) (3, 400, 400) / float32 / cpu | 3975 (+- 38) | 3880 (+- 42) (16, 3, 400, 400) / uint8 / cpu | 85571 (+-1098) | 84945 (+-1901) (16, 3, 400, 400) / float32 / cpu | 74562 (+-1859) | 73956 (+-2809) Times are in microseconds (us). Aggregated performance change of v2 vs. v1: +22.5% (improvement) [----------------- equalize @ torchvision==0.15.0a0+1098dad ----------------] | v1 | v2 1 threads: ------------------------------------------------------------------ (3, 400, 400) / uint8 / cpu | 2883 (+- 8) | 2356 (+- 8) (3, 400, 400) / uint8 / cuda | 904 (+-101) | 239 (+- 1) (3, 400, 400) / PIL | 731 (+- 1) | 727 (+- 1) (16, 3, 400, 400) / uint8 / cpu | 46920 (+-246) | 39804 (+-193) (16, 3, 400, 400) / uint8 / cuda | 14259 (+-1271) | 838 (+- 8) (3, 400, 400) / float32 / cpu | | 3001 (+- 12) (3, 400, 400) / float32 / cuda | | 287 (+- 1) (16, 3, 400, 400) / float32 / cpu | | 53390 (+- 87) (16, 3, 400, 400) / float32 / cuda | | 1010 (+- 23) 6 threads: ------------------------------------------------------------------ (3, 400, 400) / uint8 / cpu | 3602 (+- 38) | 2547 (+- 29) (16, 3, 400, 400) / uint8 / cpu | 59143 (+-1811) | 40616 (+-371) (3, 400, 400) / float32 / cpu | | 3348 (+- 47) (16, 3, 400, 400) / float32 / cpu | | 54473 (+-407) Times are in microseconds (us). Aggregated performance change of v2 vs. v1: +36.0% (improvement) [------------- five_crop @ torchvision==0.15.0a0+1098dad --------------] | v1 | v2 1 threads: ------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 39 (+- 0) | 21 (+- 0) (3, 400, 400) / uint8 / cuda | 41 (+- 0) | 22 (+- 0) (3, 400, 400) / PIL | 104 (+- 1) | 90 (+- 1) (3, 400, 400) / float32 / cpu | 39 (+- 0) | 21 (+- 0) (3, 400, 400) / float32 / cuda | 41 (+- 0) | 22 (+- 0) (16, 3, 400, 400) / uint8 / cpu | 40 (+- 0) | 21 (+- 0) (16, 3, 400, 400) / uint8 / cuda | 41 (+- 0) | 22 (+- 0) (16, 3, 400, 400) / float32 / cpu | 40 (+- 0) | 21 (+- 0) (16, 3, 400, 400) / float32 / cuda | 41 (+- 0) | 22 (+- 0) 6 threads: ------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 39 (+- 1) | 21 (+- 0) (3, 400, 400) / float32 / cpu | 39 (+- 0) | 21 (+- 0) (16, 3, 400, 400) / uint8 / cpu | 40 (+- 0) | 21 (+- 0) (16, 3, 400, 400) / float32 / cpu | 40 (+- 0) | 21 (+- 0) Times are in microseconds (us). Aggregated performance change of v2 vs. v1: +40.2% (improvement) [--------------- gaussian_blur @ torchvision==0.15.0a0+1098dad ----------------] | v1 | v2 1 threads: --------------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 4256 (+-110) | 4000 (+- 1) (3, 400, 400) / uint8 / cuda | 263 (+- 27) | 139 (+- 1) (3, 400, 400) / PIL | 7339 (+- 31) | 7143 (+- 58) (3, 400, 400) / float32 / cpu | 3634 (+- 10) | 3520 (+- 22) (3, 400, 400) / float32 / cuda | 211 (+- 2) | 111 (+- 0) (16, 3, 400, 400) / uint8 / cpu | 261353 (+-2761) | 258668 (+-930) (16, 3, 400, 400) / uint8 / cuda | 701 (+- 17) | 609 (+- 0) (16, 3, 400, 400) / float32 / cpu | 246807 (+-2297) | 246708 (+-679) (16, 3, 400, 400) / float32 / cuda | 575 (+- 1) | 485 (+- 0) 6 threads: --------------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 5466 (+- 52) | 5246 (+- 31) (3, 400, 400) / float32 / cpu | 4730 (+- 40) | 4597 (+- 38) (16, 3, 400, 400) / uint8 / cpu | 283245 (+-3426) | 280457 (+-2995) (16, 3, 400, 400) / float32 / cpu | 269702 (+-7677) | 269556 (+-3773) Times are in microseconds (us). Aggregated performance change of v2 vs. v1: +13.7% (improvement) [---------------- invert @ torchvision==0.15.0a0+1098dad ----------------] | v1 | v2 1 threads: --------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 154 (+- 1) | 19 (+- 0) (3, 400, 400) / uint8 / cuda | 13 (+- 0) | 7 (+- 0) (3, 400, 400) / PIL | 309 (+- 1) | 306 (+- 1) (3, 400, 400) / float32 / cpu | 175 (+- 1) | 166 (+- 1) (3, 400, 400) / float32 / cuda | 13 (+- 0) | 10 (+- 0) (16, 3, 400, 400) / uint8 / cpu | 2269 (+- 6) | 596 (+- 2) (16, 3, 400, 400) / uint8 / cuda | 14 (+- 0) | 14 (+- 0) (16, 3, 400, 400) / float32 / cpu | 3982 (+-104) | 4062 (+-107) (16, 3, 400, 400) / float32 / cuda | 49 (+- 0) | 49 (+- 0) 6 threads: --------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 196 (+- 2) | 58 (+- 0) (3, 400, 400) / float32 / cpu | 220 (+- 4) | 210 (+- 3) (16, 3, 400, 400) / uint8 / cpu | 2330 (+- 28) | 649 (+- 4) (16, 3, 400, 400) / float32 / cpu | 4142 (+- 80) | 4001 (+- 42) Times are in microseconds (us). Aggregated performance change of v2 vs. v1: +23.3% (improvement) [-------------- normalize @ torchvision==0.15.0a0+1098dad ---------------] | v1 | v2 1 threads: --------------------------------------------------------------- (3, 400, 400) / float32 / cpu | 376 (+- 1) | 271 (+- 1) (3, 400, 400) / float32 / cuda | 116 (+- 0) | 62 (+- 0) (16, 3, 400, 400) / float32 / cpu | 6698 (+- 15) | 5207 (+- 19) (16, 3, 400, 400) / float32 / cuda | 224 (+- 1) | 139 (+- 0) 6 threads: --------------------------------------------------------------- (3, 400, 400) / float32 / cpu | 510 (+- 5) | 359 (+- 4) (16, 3, 400, 400) / float32 / cpu | 7015 (+-227) | 5805 (+- 43) Times are in microseconds (us). Aggregated performance change of v2 vs. v1: +33.6% (improvement) [--------------- perspective @ torchvision==0.15.0a0+1098dad ----------------] | v1 | v2 1 threads: ------------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 4261 (+-148) | 4000 (+- 8) (3, 400, 400) / uint8 / cuda | 502 (+- 49) | 457 (+- 1) (3, 400, 400) / PIL | 1819 (+- 28) | 1809 (+- 7) (3, 400, 400) / float32 / cpu | 3763 (+- 11) | 3644 (+- 7) (3, 400, 400) / float32 / cuda | 441 (+- 1) | 426 (+- 1) (16, 3, 400, 400) / uint8 / cpu | 65562 (+-1493) | 60287 (+-469) (16, 3, 400, 400) / uint8 / cuda | 475 (+- 1) | 458 (+- 2) (16, 3, 400, 400) / float32 / cpu | 51973 (+-409) | 51620 (+-473) (16, 3, 400, 400) / float32 / cuda | 438 (+- 1) | 425 (+- 1) 6 threads: ------------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 4728 (+- 30) | 4509 (+- 38) (3, 400, 400) / float32 / cpu | 4072 (+- 47) | 4000 (+- 42) (16, 3, 400, 400) / uint8 / cpu | 67120 (+-1483) | 61356 (+-1763) (16, 3, 400, 400) / float32 / cpu | 51711 (+-546) | 51035 (+-1607) Times are in microseconds (us). Aggregated performance change of v2 vs. v1: +3.7% (improvement) [-------------- posterize @ torchvision==0.15.0a0+1098dad ---------------] | v1 | v2 1 threads: --------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 116 (+- 1) | 109 (+- 0) (3, 400, 400) / uint8 / cuda | 14 (+- 0) | 10 (+- 0) (3, 400, 400) / PIL | 318 (+- 1) | 313 (+- 1) (16, 3, 400, 400) / uint8 / cpu | 1621 (+- 6) | 1609 (+- 6) (16, 3, 400, 400) / uint8 / cuda | 14 (+- 0) | 14 (+- 0) (3, 400, 400) / float32 / cpu | | 410 (+- 1) (3, 400, 400) / float32 / cuda | | 26 (+- 0) (16, 3, 400, 400) / float32 / cpu | | 8108 (+- 48) (16, 3, 400, 400) / float32 / cuda | | 178 (+- 0) 6 threads: --------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 157 (+- 2) | 148 (+- 3) (16, 3, 400, 400) / uint8 / cpu | 1706 (+- 24) | 1673 (+- 25) (3, 400, 400) / float32 / cpu | | 579 (+- 4) (16, 3, 400, 400) / float32 / cpu | | 8794 (+-202) Times are in microseconds (us). Aggregated performance change of v2 vs. v1: +6.9% (improvement) [------------------- resize @ torchvision==0.15.0a0+1098dad -------------------] | v1 | v2 1 threads: --------------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 1987 (+-706) | 1945 (+-716) (3, 400, 400) / uint8 / cuda | 49 (+- 0) | 43 (+- 0) (3, 400, 400) / PIL | 1235 (+-427) | 1228 (+-435) (3, 400, 400) / float32 / cpu | 1649 (+-744) | 1638 (+-743) (3, 400, 400) / float32 / cuda | 21 (+- 0) | 17 (+- 0) (16, 3, 400, 400) / uint8 / cpu | 9739 (+-9872) | 8027 (+-11715) (16, 3, 400, 400) / uint8 / cuda | 90 (+- 16) | 89 (+- 16) (16, 3, 400, 400) / float32 / cpu | 26834 (+-12079) | 26845 (+-12087) (16, 3, 400, 400) / float32 / cuda | 23 (+- 13) | 23 (+- 13) 6 threads: --------------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 1527 (+-355) | 1482 (+-359) (3, 400, 400) / float32 / cpu | 1073 (+-398) | 1066 (+-403) (16, 3, 400, 400) / uint8 / cpu | 10123 (+-5967) | 8402 (+-5907) (16, 3, 400, 400) / float32 / cpu | 16026 (+-6296) | 16022 (+-6289) Times are in microseconds (us). Aggregated performance change of v2 vs. v1: +5.0% (improvement) [----------------- resized_crop @ torchvision==0.15.0a0+1098dad -----------------] | v1 | v2 1 threads: ----------------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 1244 (+-1802) | 1099 (+-1769) (3, 400, 400) / uint8 / cuda | 62 (+- 1) | 52 (+- 0) (3, 400, 400) / PIL | 2250 (+-979) | 2236 (+-979) (3, 400, 400) / float32 / cpu | 2902 (+-2412) | 531 (+-2408) (3, 400, 400) / float32 / cuda | 31 (+- 5) | 24 (+- 5) (16, 3, 400, 400) / uint8 / cpu | 106292 (+-38242) | 104513 (+-40684) (16, 3, 400, 400) / uint8 / cuda | 218 (+- 25) | 214 (+- 26) (16, 3, 400, 400) / float32 / cpu | 9466 (+-41424) | 9384 (+-41517) (16, 3, 400, 400) / float32 / cuda | 81 (+- 16) | 81 (+- 16) 6 threads: ----------------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 1431 (+-1432) | 1283 (+-1406) (3, 400, 400) / float32 / cpu | 3951 (+-1671) | 3945 (+-1684) (16, 3, 400, 400) / uint8 / cpu | 77433 (+-22907) | 73118 (+-23566) (16, 3, 400, 400) / float32 / cpu | 61734 (+-26037) | 61725 (+-26093) Times are in microseconds (us). Aggregated performance change of v2 vs. v1: +5.8% (improvement) [------------------- rotate @ torchvision==0.15.0a0+1098dad -------------------] | v1 | v2 1 threads: --------------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 2112 (+-854) | 2026 (+-852) (3, 400, 400) / uint8 / cuda | 263 (+- 33) | 217 (+- 3) (3, 400, 400) / PIL | 3820 (+-1719) | 3815 (+-1726) (3, 400, 400) / float32 / cpu | 1604 (+-815) | 1588 (+-822) (3, 400, 400) / float32 / cuda | 204 (+- 1) | 188 (+- 2) (16, 3, 400, 400) / uint8 / cpu | 81554 (+-16016) | 78530 (+-15963) (16, 3, 400, 400) / uint8 / cuda | 382 (+- 28) | 371 (+- 28) (16, 3, 400, 400) / float32 / cpu | 66279 (+-14933) | 66624 (+-15357) (16, 3, 400, 400) / float32 / cuda | 255 (+- 28) | 247 (+- 29) 6 threads: --------------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 2434 (+-843) | 2356 (+-828) (3, 400, 400) / float32 / cpu | 1791 (+-833) | 1783 (+-834) (16, 3, 400, 400) / uint8 / cpu | 81143 (+-16182) | 80019 (+-16450) (16, 3, 400, 400) / float32 / cpu | 66174 (+-15374) | 66373 (+-15422) Times are in microseconds (us). Aggregated performance change of v2 vs. v1: +3.6% (improvement) [---------------- solarize @ torchvision==0.15.0a0+1098dad ----------------] | v1 | v2 1 threads: ----------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 1088 (+- 5) | 962 (+- 5) (3, 400, 400) / uint8 / cuda | 32 (+- 0) | 23 (+- 0) (3, 400, 400) / PIL | 316 (+- 1) | 314 (+- 1) (3, 400, 400) / float32 / cpu | 2357 (+- 39) | 2344 (+- 5) (3, 400, 400) / float32 / cuda | 33 (+- 0) | 27 (+- 0) (16, 3, 400, 400) / uint8 / cpu | 18286 (+- 67) | 16803 (+- 67) (16, 3, 400, 400) / uint8 / cuda | 59 (+- 0) | 62 (+- 0) (16, 3, 400, 400) / float32 / cpu | 38606 (+- 85) | 38526 (+-217) (16, 3, 400, 400) / float32 / cuda | 157 (+- 0) | 158 (+- 0) 6 threads: ----------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 1247 (+- 22) | 1119 (+- 18) (3, 400, 400) / float32 / cpu | 2502 (+- 30) | 2493 (+- 59) (16, 3, 400, 400) / uint8 / cpu | 18600 (+-399) | 17198 (+-256) (16, 3, 400, 400) / float32 / cpu | 38938 (+-405) | 38945 (+-370) Times are in microseconds (us). Aggregated performance change of v2 vs. v1: +6.2% (improvement) [--------------- ten_crop @ torchvision==0.15.0a0+1098dad ---------------] | v1 | v2 1 threads: --------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 381 (+- 1) | 337 (+- 1) (3, 400, 400) / uint8 / cuda | 97 (+- 0) | 56 (+- 0) (3, 400, 400) / PIL | 346 (+- 1) | 309 (+- 1) (3, 400, 400) / float32 / cpu | 318 (+- 2) | 273 (+- 1) (3, 400, 400) / float32 / cuda | 98 (+- 1) | 56 (+- 0) (16, 3, 400, 400) / uint8 / cpu | 4648 (+- 12) | 4602 (+- 18) (16, 3, 400, 400) / uint8 / cuda | 99 (+- 0) | 57 (+- 0) (16, 3, 400, 400) / float32 / cpu | 5918 (+- 81) | 5887 (+- 60) (16, 3, 400, 400) / float32 / cuda | 98 (+- 1) | 56 (+- 0) 6 threads: --------------------------------------------------------------- (3, 400, 400) / uint8 / cpu | 426 (+- 5) | 382 (+- 4) (3, 400, 400) / float32 / cpu | 367 (+- 4) | 322 (+- 3) (16, 3, 400, 400) / uint8 / cpu | 4731 (+- 31) | 4685 (+- 48) (16, 3, 400, 400) / float32 / cpu | 5951 (+- 67) | 5892 (+- 62) Times are in microseconds (us). Aggregated performance change of v2 vs. v1: +21.7% (improvement) ```

pytorch / vision

Performance improvements for transforms v2 vs. v1 #6818

Speed Benchmarks V1 vs V2

Summary

Speed Benchmarks

Training

Transform Classes

Functional Kernels