@dcrc2 and I managed to vectorise erf (we presume) by writing an embedded C++ function using aten. It is still about 50% slower than PyTorch on the largest problem size. We hypothesise that PyTorch is faster because it takes advantage of a BOG whereas our embedded C++, embedded ks, and ts2ks functions can't take advantage of a BOG. In fact forward_template stores only the arguments to the function (in other words, it "checkpoints" the function call). Contrary to my earlier expectation it seems that we can (and must) benefit from a BOG in these kernels so we will need https://github.com/microsoft/knossos-ksc/issues/818.
@dcrc2 and I managed to vectorise erf (we presume) by writing an embedded C++ function using aten. It is still about 50% slower than PyTorch on the largest problem size. We hypothesise that PyTorch is faster because it takes advantage of a BOG whereas our embedded C++, embedded ks, and ts2ks functions can't take advantage of a BOG. In fact
forward_template
stores only the arguments to the function (in other words, it "checkpoints" the function call). Contrary to my earlier expectation it seems that we can (and must) benefit from a BOG in these kernels so we will need https://github.com/microsoft/knossos-ksc/issues/818.https://github.com/microsoft/knossos-ksc/blob/e0fe83263828ad9080f78aec3fff78ee8cd87b46/src/python/ksc/torch_frontend.py#L448-L459
(generated from 2c1799327759f7c220ac5c010ea5da2e1489d0bd)
How we discovered the BOG on PyTorch objects: