Add "ks_embedded" benchmarks

Completes ADO 19518. Supersedes #857 and https://github.com/microsoft/knossos-ksc/pull/859.
Problem addressed

Inability to benchmark the performance of KS code run from Python (except KS code generated by ts2ks). Without that ability it's hard to inspect the impact of potential optimisations that could be added to ksc.
Solution implemented

ksc_string_to_autograd_function allows embedding KS code in a Python source file (in a fairly low level way). conftest.py has been amended to support benchmarking such usages (for functions prefixed with <benchmarkname>_ks_embedded_).
Why solution should work

It is now straightforward to embed KS code in Python and benchmark it.
Results

% python3 -m pytest src/bench/ --benchmark-sort=name --benchmark-group-by=group,func --modulepath=examples/dl-activations/relu3 --benchmarkname=vrelu3 --benchmark-autosave
...
------------------------------------------------------------------------------------------------------------- benchmark 'torch.Size([16]) test_backwards': 4 tests ------------------------------------------------------------------------------------------------------------
Name (time in us)                                                                                           Min                 Max               Mean             StdDev             Median                IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_backwards[vrelu3_pytorch-Knossos embedded checkpointed_map-torch.Size([16])]                       28.8000 (1.02)      96.5000 (1.31)     31.9838 (1.03)      5.7976 (1.24)     30.8000 (1.03)      2.2999 (1.44)        68;79       31.2658 (0.97)       1000           1
test_backwards[vrelu3_pytorch-Knossos embedded checkpointed_map_handwritten_relu3-torch.Size([16])]     28.3000 (1.0)       73.7000 (1.0)      30.9282 (1.0)       4.6729 (1.0)      30.0000 (1.0)       1.6000 (1.0)         53;74       32.3330 (1.0)        1000           1
test_backwards[vrelu3_pytorch-Knossos-torch.Size([16])]                                                 29.5000 (1.04)     259.1000 (3.52)     48.0474 (1.55)     31.7818 (6.80)     31.8000 (1.06)     18.7000 (11.69)     145;147       20.8128 (0.64)       1000           1
test_backwards[vrelu3_pytorch-PyTorch-torch.Size([16])]                                                 29.8000 (1.05)     108.5000 (1.47)     32.6943 (1.06)      5.7083 (1.22)     31.0000 (1.03)      2.0000 (1.25)        59;89       30.5864 (0.95)       1000           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------ benchmark 'torch.Size([16]) test_forward': 4 tests -----------------------------------------------------------------------------------------------------------
Name (time in us)                                                                                         Min                 Max               Mean            StdDev             Median               IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_forward[vrelu3_pytorch-Knossos embedded checkpointed_map-torch.Size([16])]                        8.4000 (1.01)     299.5000 (4.63)     12.1827 (1.22)     8.3676 (3.41)      9.2000 (1.05)     1.2000 (1.50)    1837;3041       82.0835 (0.82)      17637           1
test_forward[vrelu3_pytorch-Knossos embedded checkpointed_map_handwritten_relu3-torch.Size([16])]      8.2999 (1.0)      218.0000 (3.37)     10.6237 (1.06)     6.3270 (2.58)      8.8000 (1.0)      0.7999 (1.0)     1195;2181       94.1288 (0.94)      17183           1
test_forward[vrelu3_pytorch-Knossos-torch.Size([16])]                                                  8.8000 (1.06)      64.7000 (1.0)       9.9979 (1.0)      2.4524 (1.0)       9.3000 (1.06)     0.9000 (1.13)     854;1128      100.0210 (1.0)       15409           1
test_forward[vrelu3_pytorch-PyTorch-torch.Size([16])]                                                 23.6000 (2.84)     110.8000 (1.71)     26.1701 (2.62)     5.2022 (2.12)     24.7000 (2.81)     2.0000 (2.50)      471;579       38.2116 (0.38)      10505           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------ benchmark 'torch.Size([16]) test_inference': 4 tests -----------------------------------------------------------------------------------------------------------
Name (time in us)                                                                                           Min                 Max               Mean            StdDev             Median               IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_inference[vrelu3_pytorch-Knossos embedded checkpointed_map-torch.Size([16])]                        7.0000 (1.0)      250.6000 (4.44)      9.1777 (1.17)     5.6040 (2.67)      7.5000 (1.01)     0.7000 (1.40)    1462;2514      108.9593 (0.85)      19268           1
test_inference[vrelu3_pytorch-Knossos embedded checkpointed_map_handwritten_relu3-torch.Size([16])]      7.0000 (1.0)       56.6000 (1.00)      7.8128 (1.0)      2.0993 (1.0)       7.4000 (1.0)      0.5000 (1.0)       369;875      127.9944 (1.0)       17922           1
test_inference[vrelu3_pytorch-Knossos-torch.Size([16])]                                                  7.5000 (1.07)      56.4000 (1.0)       8.2815 (1.06)     2.1038 (1.00)      7.8000 (1.05)     0.6000 (1.20)      350;604      120.7514 (0.94)      14926           1
test_inference[vrelu3_pytorch-PyTorch-torch.Size([16])]                                                 17.5000 (2.50)     113.9000 (2.02)     19.4720 (2.49)     4.2386 (2.02)     18.3000 (2.47)     1.4000 (2.80)      623;727       51.3557 (0.40)      13405           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------ benchmark 'torch.Size([4]) test_backwards': 4 tests -------------------------------------------------------------------------------------------------------------
Name (time in us)                                                                                          Min                 Max               Mean             StdDev             Median                IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_backwards[vrelu3_pytorch-Knossos embedded checkpointed_map-torch.Size([4])]                       28.6000 (1.02)     146.2000 (1.21)     34.7883 (1.12)      7.9462 (1.20)     33.0000 (1.12)      6.4500 (3.15)       168;49       28.7453 (0.89)       1000           1
test_backwards[vrelu3_pytorch-Knossos embedded checkpointed_map_handwritten_relu3-torch.Size([4])]     28.0000 (1.0)      121.0000 (1.0)      31.0249 (1.0)       6.5959 (1.0)      29.4500 (1.0)       2.0500 (1.0)         60;75       32.2322 (1.0)        1000           1
test_backwards[vrelu3_pytorch-Knossos-torch.Size([4])]                                                 29.3000 (1.05)     680.6000 (5.62)     33.2222 (1.07)     21.4719 (3.26)     31.1000 (1.06)      2.2000 (1.07)        17;81       30.1004 (0.93)       1000           1
test_backwards[vrelu3_pytorch-PyTorch-torch.Size([4])]                                                 29.7000 (1.06)     288.3000 (2.38)     46.9940 (1.51)     27.1909 (4.12)     37.4000 (1.27)     15.4000 (7.51)      117;158       21.2793 (0.66)       1000           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------ benchmark 'torch.Size([4]) test_forward': 4 tests -------------------------------------------------------------------------------------------------------------
Name (time in us)                                                                                        Min                 Max               Mean             StdDev             Median                IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_forward[vrelu3_pytorch-Knossos embedded checkpointed_map-torch.Size([4])]                        8.3000 (1.01)     162.9000 (2.21)     10.8593 (1.17)      6.1737 (2.35)      8.9000 (1.02)      0.8000 (1.14)    1084;1824       92.0870 (0.86)      14772           1
test_forward[vrelu3_pytorch-Knossos embedded checkpointed_map_handwritten_relu3-torch.Size([4])]      8.2000 (1.0)       73.6000 (1.0)       9.3094 (1.0)       2.6318 (1.0)       8.7000 (1.0)       0.7000 (1.0)       588;827      107.4188 (1.0)       15221           1
test_forward[vrelu3_pytorch-Knossos-torch.Size([4])]                                                  8.7000 (1.06)     150.3000 (2.04)     19.3721 (2.08)     12.0126 (4.56)     12.5000 (1.44)     21.5000 (30.71)      954;29       51.6207 (0.48)       4483           1
test_forward[vrelu3_pytorch-PyTorch-torch.Size([4])]                                                 23.7000 (2.89)     266.1000 (3.62)     29.3479 (3.15)     14.0189 (5.33)     25.7000 (2.95)      2.5001 (3.57)     692;1019       34.0740 (0.32)       9785           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------ benchmark 'torch.Size([4]) test_inference': 4 tests ------------------------------------------------------------------------------------------------------------
Name (time in us)                                                                                          Min                 Max               Mean             StdDev             Median               IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_inference[vrelu3_pytorch-Knossos embedded checkpointed_map-torch.Size([4])]                        7.1000 (1.01)     274.4000 (3.63)      9.9109 (1.22)      6.9465 (2.75)      7.4000 (1.0)      0.6000 (1.0)     1992;2326      100.8995 (0.82)      14493           1
test_inference[vrelu3_pytorch-Knossos embedded checkpointed_map_handwritten_relu3-torch.Size([4])]      7.0000 (1.0)      170.3000 (2.25)     10.2539 (1.26)      7.3082 (2.90)      7.4000 (1.0)      1.0000 (1.67)    2201;3118       97.5239 (0.80)      16104           1
test_inference[vrelu3_pytorch-Knossos-torch.Size([4])]                                                  7.3000 (1.04)      75.6001 (1.0)       8.1564 (1.0)       2.5223 (1.0)       7.7000 (1.04)     0.7000 (1.17)        50;82      122.6034 (1.0)        3196           1
test_inference[vrelu3_pytorch-PyTorch-torch.Size([4])]                                                 17.5000 (2.50)     222.4000 (2.94)     23.6523 (2.90)     14.4866 (5.74)     18.6000 (2.51)     2.1000 (3.50)    1288;2042       42.2792 (0.34)      11588           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Discussion

The API is quite low level. It may be possible to make a higher level API that's easier to use, but it will do for now.
It's not clear what the name for the map f derivative map sufrev$f should be (or even if we really need to give it a name).
microsoft / knossos-ksc