microsoft / knossos-ksc

Compiler with automatic differentiation
Other
45 stars 10 forks source link

Generate and compile CUDA code #976

Closed dcrc2 closed 2 years ago

dcrc2 commented 2 years ago

Completes end-to-end CUDA support for elementwise functions:

dcrc2 commented 2 years ago

vrelu3 benchmarks:

-------------------------------------------------------------------------------- benchmark 'torch.Size([1048576]) test_backwards': 6 tests ---------------------------------------------------------------------------------
Name (time in ms)                                                                Median               IQR            Outliers     Mean            StdDev                Min                Max            Iterations  Rounds
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
backwards[vrelu3_pytorch-Knossos CUDA-torch.Size([1048576])]                     1.2806 (1.24)     0.0128 (1.0)        61;294   1.2992 (1.24)     0.2028 (1.28)      1.1950 (1.19)      7.3775 (1.33)              1    3540
backwards[vrelu3_pytorch-Knossos-torch.Size([1048576])]                         26.2214 (25.46)    0.6268 (48.97)       17;17  26.5462 (25.37)    1.1526 (7.26)     25.6041 (25.41)    32.7144 (5.91)              1     191
backwards[vrelu3_pytorch-Manual CUDA (with transfer)-torch.Size([1048576])]      1.8976 (1.84)     0.0416 (3.25)       67;217   1.9291 (1.84)     0.2513 (1.58)      1.8193 (1.81)      7.5638 (1.37)              1    2670
backwards[vrelu3_pytorch-Manual CUDA-torch.Size([1048576])]                      1.0301 (1.0)      0.0147 (1.15)      104;322   1.0465 (1.0)      0.1588 (1.0)       1.0077 (1.0)       5.5338 (1.0)               1    4814
backwards[vrelu3_pytorch-PyTorch CUDA-torch.Size([1048576])]                     1.4653 (1.42)     0.0189 (1.48)       62;417   1.5122 (1.45)     0.2858 (1.80)      1.4410 (1.43)      8.1997 (1.48)              1    2978
backwards[vrelu3_pytorch-PyTorch-torch.Size([1048576])]                          1.6212 (1.57)     0.4094 (31.99)     107;123   1.7982 (1.72)     0.7739 (4.87)      1.1061 (1.10)     13.2313 (2.39)              1    1232
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------- benchmark 'torch.Size([1048576]) test_forward': 6 tests --------------------------------------------------------------------------------------------
Name (time in us)                                                                  Median                 IQR            Outliers         Mean                StdDev                    Min                    Max            Iterations  Rounds
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
forward[vrelu3_pytorch-Knossos CUDA-torch.Size([1048576])]                       218.3970 (1.66)       6.5000 (1.07)    2099;6089     217.2339 (1.62)        61.8629 (1.09)         31.6000 (1.0)       6,680.3870 (1.13)              1   73532
forward[vrelu3_pytorch-Knossos-torch.Size([1048576])]                         35,061.1035 (266.22)   345.4935 (56.64)         3;8  35,357.5663 (264.33)   1,895.2845 (33.39)    34,617.0120 (>1000.0)  55,030.3650 (9.31)              1     140
forward[vrelu3_pytorch-Manual CUDA (with transfer)-torch.Size([1048576])]      1,684.2710 (12.79)     13.0990 (2.15)       93;229   1,697.1261 (12.69)      119.4271 (2.10)      1,667.6720 (52.77)     6,737.2860 (1.14)              1    2594
forward[vrelu3_pytorch-Manual CUDA-torch.Size([1048576])]                        131.6980 (1.0)        6.1000 (1.0)      275;2254     133.7631 (1.0)         56.7587 (1.0)         123.6970 (3.91)      5,912.0010 (1.0)               1   33180
forward[vrelu3_pytorch-PyTorch CUDA-torch.Size([1048576])]                       651.6890 (4.95)      10.3000 (1.69)     325;1029     673.9096 (5.04)       136.5295 (2.41)        636.7890 (20.15)     6,440.1930 (1.09)              1    5832
forward[vrelu3_pytorch-PyTorch-torch.Size([1048576])]                          1,834.3195 (13.93)    290.1950 (47.57)     164;233   1,984.6801 (14.84)      728.3780 (12.83)     1,404.1770 (44.44)    14,083.2640 (2.38)              1    2730
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------ benchmark 'torch.Size([1048576]) test_inference': 6 tests -------------------------------------------------------------------------------------------
Name (time in us)                                                                    Median                 IQR            Outliers         Mean              StdDev                    Min                    Max            Iterations  Rounds
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
inference[vrelu3_pytorch-Knossos CUDA-torch.Size([1048576])]                       220.2960 (1.78)       5.8990 (2.56)    2034;6762     218.8017 (1.73)      69.6625 (3.65)         28.3990 (1.0)       8,913.4460 (3.87)              1   79241
inference[vrelu3_pytorch-Knossos-torch.Size([1048576])]                         35,132.0780 (283.10)   512.0407 (222.63)      13;11  35,313.2812 (279.67)   711.3912 (37.31)    34,643.0860 (>1000.0)  39,056.4080 (16.94)             1     135
inference[vrelu3_pytorch-Manual CUDA (with transfer)-torch.Size([1048576])]      1,680.9710 (13.55)     19.5998 (8.52)       39;247   1,713.3007 (13.57)    274.4847 (14.40)     1,660.4710 (58.47)     8,932.9460 (3.87)              1    2835
inference[vrelu3_pytorch-Manual CUDA-torch.Size([1048576])]                        124.0980 (1.0)        2.3000 (1.0)      418;3532     126.2678 (1.0)       19.0669 (1.0)         119.3980 (4.20)      2,305.9610 (1.0)               1   35015
inference[vrelu3_pytorch-PyTorch CUDA-torch.Size([1048576])]                       648.6890 (5.23)       5.6000 (2.43)       61;469     654.8684 (5.19)     130.5742 (6.85)        636.5890 (22.42)     8,288.8590 (3.59)              1    7583
inference[vrelu3_pytorch-PyTorch-torch.Size([1048576])]                          1,761.9690 (14.20)    248.4210 (108.01)    110;185   1,854.6426 (14.69)    624.9107 (32.77)     1,426.7760 (50.24)    12,904.4790 (5.60)              1    2865
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------- benchmark 'torch.Size([65025]) test_backwards': 6 tests ----------------------------------------------------------------------------------------
Name (time in us)                                                                 Median                IQR            Outliers        Mean              StdDev                   Min                   Max            Iterations  Rounds
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
backwards[vrelu3_pytorch-Knossos CUDA-torch.Size([65025])]                      251.7950 (1.0)      13.4000 (1.15)      161;356    262.6270 (1.0)      125.7263 (3.65)       235.7960 (1.0)      6,530.8920 (5.87)              1   10176
backwards[vrelu3_pytorch-Knossos-torch.Size([65025])]                         1,728.7720 (6.87)     17.7495 (1.52)      124;354  1,761.6434 (6.71)     179.0985 (5.21)     1,674.7720 (7.10)     6,822.3870 (6.14)              1    2688
backwards[vrelu3_pytorch-Manual CUDA (with transfer)-torch.Size([65025])]       372.4940 (1.48)     13.9490 (1.19)       54;111    379.8654 (1.45)      89.3918 (2.60)       348.5950 (1.48)     2,817.3530 (2.53)              1    2172
backwards[vrelu3_pytorch-Manual CUDA-torch.Size([65025])]                       266.0950 (1.06)     16.7998 (1.44)      103;196    276.1523 (1.05)     126.3476 (3.67)       244.7960 (1.04)     6,343.6950 (5.71)              1    5387
backwards[vrelu3_pytorch-PyTorch CUDA-torch.Size([65025])]                      297.4955 (1.18)     11.7000 (1.0)        67;105    302.7649 (1.15)      34.4087 (1.0)        277.7960 (1.18)     1,111.7810 (1.0)               1    2222
backwards[vrelu3_pytorch-PyTorch-torch.Size([65025])]                           305.6950 (1.21)     21.1733 (1.81)      188;933    325.9707 (1.24)     219.5728 (6.38)       273.2960 (1.16)     9,695.5410 (8.72)              1   14707
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------- benchmark 'torch.Size([65025]) test_forward': 6 tests -----------------------------------------------------------------------------------------
Name (time in us)                                                               Median                IQR            Outliers        Mean              StdDev                   Min                    Max            Iterations  Rounds
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
forward[vrelu3_pytorch-Knossos CUDA-torch.Size([65025])]                       32.5000 (1.0)       0.9000 (1.0)    1008;11621     34.4280 (1.0)       16.0268 (1.0)         30.5990 (1.0)       2,453.6590 (1.92)              1   76221
forward[vrelu3_pytorch-Knossos-torch.Size([65025])]                         2,181.7130 (67.13)    18.3990 (20.44)     115;297  2,214.2056 (64.31)    164.9877 (10.29)    2,124.3640 (69.43)     7,129.6790 (5.57)              1    2186
forward[vrelu3_pytorch-Manual CUDA (with transfer)-torch.Size([65025])]       280.9950 (8.65)     25.2000 (28.00)     214;306    289.1545 (8.40)     119.6598 (7.47)       258.0950 (8.43)      6,460.6910 (5.05)              1   15400
forward[vrelu3_pytorch-Manual CUDA-torch.Size([65025])]                        78.8980 (2.43)     12.5000 (13.89)     345;862     77.1894 (2.24)      39.8828 (2.49)        63.5980 (2.08)      5,866.9010 (4.58)              1   49852
forward[vrelu3_pytorch-PyTorch CUDA-torch.Size([65025])]                      190.6970 (5.87)     15.9000 (17.67)      94;103    197.7932 (5.75)      35.6641 (2.23)       181.2970 (5.92)      1,279.8790 (1.0)               1    4381
forward[vrelu3_pytorch-PyTorch-torch.Size([65025])]                           323.8940 (9.97)     24.9010 (27.67)    111;1875    350.7723 (10.19)    302.9661 (18.90)      292.6950 (9.57)     18,128.4960 (14.16)             1   10851
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------- benchmark 'torch.Size([65025]) test_inference': 6 tests -----------------------------------------------------------------------------------------
Name (time in us)                                                                 Median                IQR             Outliers        Mean              StdDev                   Min                   Max            Iterations  Rounds
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
inference[vrelu3_pytorch-Knossos CUDA-torch.Size([65025])]                       29.2990 (1.0)       0.8010 (1.0)      826;13731     31.2246 (1.0)       17.6233 (1.0)         27.6990 (1.0)      2,474.9570 (1.0)               1   82510
inference[vrelu3_pytorch-Knossos-torch.Size([65025])]                         2,177.3620 (74.32)    18.6000 (23.22)       75;277  2,220.7195 (71.12)    271.8220 (15.42)    2,131.0620 (76.94)    9,033.4400 (3.65)              1    2036
inference[vrelu3_pytorch-Manual CUDA (with transfer)-torch.Size([65025])]       268.1950 (9.15)     21.7000 (27.09)      250;369    274.2801 (8.78)      82.6097 (4.69)       248.1960 (8.96)     5,729.3010 (2.31)              1   16224
inference[vrelu3_pytorch-Manual CUDA-torch.Size([65025])]                        71.9980 (2.46)     12.7990 (15.98)      715;946     69.8567 (2.24)      28.5744 (1.62)        57.3990 (2.07)     3,624.1380 (1.46)              1   49654
inference[vrelu3_pytorch-PyTorch CUDA-torch.Size([65025])]                      153.8970 (5.25)     15.9995 (19.97)      371;388    161.2729 (5.16)      32.5686 (1.85)       148.3970 (5.36)     2,479.1570 (1.00)              1   20484
inference[vrelu3_pytorch-PyTorch-torch.Size([65025])]                           302.8950 (10.34)    14.5990 (18.23)     235;1454    320.2604 (10.26)    167.6482 (9.51)       275.7950 (9.96)     6,516.7880 (2.63)              1   11937
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------