Closed dcrc2 closed 2 years ago
vrelu3 benchmarks:
-------------------------------------------------------------------------------- benchmark 'torch.Size([1048576]) test_backwards': 6 tests ---------------------------------------------------------------------------------
Name (time in ms) Median IQR Outliers Mean StdDev Min Max Iterations Rounds
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
backwards[vrelu3_pytorch-Knossos CUDA-torch.Size([1048576])] 1.2806 (1.24) 0.0128 (1.0) 61;294 1.2992 (1.24) 0.2028 (1.28) 1.1950 (1.19) 7.3775 (1.33) 1 3540
backwards[vrelu3_pytorch-Knossos-torch.Size([1048576])] 26.2214 (25.46) 0.6268 (48.97) 17;17 26.5462 (25.37) 1.1526 (7.26) 25.6041 (25.41) 32.7144 (5.91) 1 191
backwards[vrelu3_pytorch-Manual CUDA (with transfer)-torch.Size([1048576])] 1.8976 (1.84) 0.0416 (3.25) 67;217 1.9291 (1.84) 0.2513 (1.58) 1.8193 (1.81) 7.5638 (1.37) 1 2670
backwards[vrelu3_pytorch-Manual CUDA-torch.Size([1048576])] 1.0301 (1.0) 0.0147 (1.15) 104;322 1.0465 (1.0) 0.1588 (1.0) 1.0077 (1.0) 5.5338 (1.0) 1 4814
backwards[vrelu3_pytorch-PyTorch CUDA-torch.Size([1048576])] 1.4653 (1.42) 0.0189 (1.48) 62;417 1.5122 (1.45) 0.2858 (1.80) 1.4410 (1.43) 8.1997 (1.48) 1 2978
backwards[vrelu3_pytorch-PyTorch-torch.Size([1048576])] 1.6212 (1.57) 0.4094 (31.99) 107;123 1.7982 (1.72) 0.7739 (4.87) 1.1061 (1.10) 13.2313 (2.39) 1 1232
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------- benchmark 'torch.Size([1048576]) test_forward': 6 tests --------------------------------------------------------------------------------------------
Name (time in us) Median IQR Outliers Mean StdDev Min Max Iterations Rounds
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
forward[vrelu3_pytorch-Knossos CUDA-torch.Size([1048576])] 218.3970 (1.66) 6.5000 (1.07) 2099;6089 217.2339 (1.62) 61.8629 (1.09) 31.6000 (1.0) 6,680.3870 (1.13) 1 73532
forward[vrelu3_pytorch-Knossos-torch.Size([1048576])] 35,061.1035 (266.22) 345.4935 (56.64) 3;8 35,357.5663 (264.33) 1,895.2845 (33.39) 34,617.0120 (>1000.0) 55,030.3650 (9.31) 1 140
forward[vrelu3_pytorch-Manual CUDA (with transfer)-torch.Size([1048576])] 1,684.2710 (12.79) 13.0990 (2.15) 93;229 1,697.1261 (12.69) 119.4271 (2.10) 1,667.6720 (52.77) 6,737.2860 (1.14) 1 2594
forward[vrelu3_pytorch-Manual CUDA-torch.Size([1048576])] 131.6980 (1.0) 6.1000 (1.0) 275;2254 133.7631 (1.0) 56.7587 (1.0) 123.6970 (3.91) 5,912.0010 (1.0) 1 33180
forward[vrelu3_pytorch-PyTorch CUDA-torch.Size([1048576])] 651.6890 (4.95) 10.3000 (1.69) 325;1029 673.9096 (5.04) 136.5295 (2.41) 636.7890 (20.15) 6,440.1930 (1.09) 1 5832
forward[vrelu3_pytorch-PyTorch-torch.Size([1048576])] 1,834.3195 (13.93) 290.1950 (47.57) 164;233 1,984.6801 (14.84) 728.3780 (12.83) 1,404.1770 (44.44) 14,083.2640 (2.38) 1 2730
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------ benchmark 'torch.Size([1048576]) test_inference': 6 tests -------------------------------------------------------------------------------------------
Name (time in us) Median IQR Outliers Mean StdDev Min Max Iterations Rounds
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
inference[vrelu3_pytorch-Knossos CUDA-torch.Size([1048576])] 220.2960 (1.78) 5.8990 (2.56) 2034;6762 218.8017 (1.73) 69.6625 (3.65) 28.3990 (1.0) 8,913.4460 (3.87) 1 79241
inference[vrelu3_pytorch-Knossos-torch.Size([1048576])] 35,132.0780 (283.10) 512.0407 (222.63) 13;11 35,313.2812 (279.67) 711.3912 (37.31) 34,643.0860 (>1000.0) 39,056.4080 (16.94) 1 135
inference[vrelu3_pytorch-Manual CUDA (with transfer)-torch.Size([1048576])] 1,680.9710 (13.55) 19.5998 (8.52) 39;247 1,713.3007 (13.57) 274.4847 (14.40) 1,660.4710 (58.47) 8,932.9460 (3.87) 1 2835
inference[vrelu3_pytorch-Manual CUDA-torch.Size([1048576])] 124.0980 (1.0) 2.3000 (1.0) 418;3532 126.2678 (1.0) 19.0669 (1.0) 119.3980 (4.20) 2,305.9610 (1.0) 1 35015
inference[vrelu3_pytorch-PyTorch CUDA-torch.Size([1048576])] 648.6890 (5.23) 5.6000 (2.43) 61;469 654.8684 (5.19) 130.5742 (6.85) 636.5890 (22.42) 8,288.8590 (3.59) 1 7583
inference[vrelu3_pytorch-PyTorch-torch.Size([1048576])] 1,761.9690 (14.20) 248.4210 (108.01) 110;185 1,854.6426 (14.69) 624.9107 (32.77) 1,426.7760 (50.24) 12,904.4790 (5.60) 1 2865
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------- benchmark 'torch.Size([65025]) test_backwards': 6 tests ----------------------------------------------------------------------------------------
Name (time in us) Median IQR Outliers Mean StdDev Min Max Iterations Rounds
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
backwards[vrelu3_pytorch-Knossos CUDA-torch.Size([65025])] 251.7950 (1.0) 13.4000 (1.15) 161;356 262.6270 (1.0) 125.7263 (3.65) 235.7960 (1.0) 6,530.8920 (5.87) 1 10176
backwards[vrelu3_pytorch-Knossos-torch.Size([65025])] 1,728.7720 (6.87) 17.7495 (1.52) 124;354 1,761.6434 (6.71) 179.0985 (5.21) 1,674.7720 (7.10) 6,822.3870 (6.14) 1 2688
backwards[vrelu3_pytorch-Manual CUDA (with transfer)-torch.Size([65025])] 372.4940 (1.48) 13.9490 (1.19) 54;111 379.8654 (1.45) 89.3918 (2.60) 348.5950 (1.48) 2,817.3530 (2.53) 1 2172
backwards[vrelu3_pytorch-Manual CUDA-torch.Size([65025])] 266.0950 (1.06) 16.7998 (1.44) 103;196 276.1523 (1.05) 126.3476 (3.67) 244.7960 (1.04) 6,343.6950 (5.71) 1 5387
backwards[vrelu3_pytorch-PyTorch CUDA-torch.Size([65025])] 297.4955 (1.18) 11.7000 (1.0) 67;105 302.7649 (1.15) 34.4087 (1.0) 277.7960 (1.18) 1,111.7810 (1.0) 1 2222
backwards[vrelu3_pytorch-PyTorch-torch.Size([65025])] 305.6950 (1.21) 21.1733 (1.81) 188;933 325.9707 (1.24) 219.5728 (6.38) 273.2960 (1.16) 9,695.5410 (8.72) 1 14707
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------- benchmark 'torch.Size([65025]) test_forward': 6 tests -----------------------------------------------------------------------------------------
Name (time in us) Median IQR Outliers Mean StdDev Min Max Iterations Rounds
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
forward[vrelu3_pytorch-Knossos CUDA-torch.Size([65025])] 32.5000 (1.0) 0.9000 (1.0) 1008;11621 34.4280 (1.0) 16.0268 (1.0) 30.5990 (1.0) 2,453.6590 (1.92) 1 76221
forward[vrelu3_pytorch-Knossos-torch.Size([65025])] 2,181.7130 (67.13) 18.3990 (20.44) 115;297 2,214.2056 (64.31) 164.9877 (10.29) 2,124.3640 (69.43) 7,129.6790 (5.57) 1 2186
forward[vrelu3_pytorch-Manual CUDA (with transfer)-torch.Size([65025])] 280.9950 (8.65) 25.2000 (28.00) 214;306 289.1545 (8.40) 119.6598 (7.47) 258.0950 (8.43) 6,460.6910 (5.05) 1 15400
forward[vrelu3_pytorch-Manual CUDA-torch.Size([65025])] 78.8980 (2.43) 12.5000 (13.89) 345;862 77.1894 (2.24) 39.8828 (2.49) 63.5980 (2.08) 5,866.9010 (4.58) 1 49852
forward[vrelu3_pytorch-PyTorch CUDA-torch.Size([65025])] 190.6970 (5.87) 15.9000 (17.67) 94;103 197.7932 (5.75) 35.6641 (2.23) 181.2970 (5.92) 1,279.8790 (1.0) 1 4381
forward[vrelu3_pytorch-PyTorch-torch.Size([65025])] 323.8940 (9.97) 24.9010 (27.67) 111;1875 350.7723 (10.19) 302.9661 (18.90) 292.6950 (9.57) 18,128.4960 (14.16) 1 10851
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------- benchmark 'torch.Size([65025]) test_inference': 6 tests -----------------------------------------------------------------------------------------
Name (time in us) Median IQR Outliers Mean StdDev Min Max Iterations Rounds
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
inference[vrelu3_pytorch-Knossos CUDA-torch.Size([65025])] 29.2990 (1.0) 0.8010 (1.0) 826;13731 31.2246 (1.0) 17.6233 (1.0) 27.6990 (1.0) 2,474.9570 (1.0) 1 82510
inference[vrelu3_pytorch-Knossos-torch.Size([65025])] 2,177.3620 (74.32) 18.6000 (23.22) 75;277 2,220.7195 (71.12) 271.8220 (15.42) 2,131.0620 (76.94) 9,033.4400 (3.65) 1 2036
inference[vrelu3_pytorch-Manual CUDA (with transfer)-torch.Size([65025])] 268.1950 (9.15) 21.7000 (27.09) 250;369 274.2801 (8.78) 82.6097 (4.69) 248.1960 (8.96) 5,729.3010 (2.31) 1 16224
inference[vrelu3_pytorch-Manual CUDA-torch.Size([65025])] 71.9980 (2.46) 12.7990 (15.98) 715;946 69.8567 (2.24) 28.5744 (1.62) 57.3990 (2.07) 3,624.1380 (1.46) 1 49654
inference[vrelu3_pytorch-PyTorch CUDA-torch.Size([65025])] 153.8970 (5.25) 15.9995 (19.97) 371;388 161.2729 (5.16) 32.5686 (1.85) 148.3970 (5.36) 2,479.1570 (1.00) 1 20484
inference[vrelu3_pytorch-PyTorch-torch.Size([65025])] 302.8950 (10.34) 14.5990 (18.23) 235;1454 320.2604 (10.26) 167.6482 (9.51) 275.7950 (9.96) 6,516.7880 (2.63) 1 11937
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Completes end-to-end CUDA support for elementwise functions:
knossos.elementwise
can be compiled twice, to support both CPU and GPU.