gelu upper bounds (vectorised erf)

@dcrc2 and I managed to vectorise erf (we presume) by writing an embedded C++ function using aten. It is still about 50% slower than PyTorch on the largest problem size. We hypothesise that PyTorch is faster because it takes advantage of a BOG whereas our embedded C++, embedded ks, and ts2ks functions can't take advantage of a BOG. In fact forward_template stores only the arguments to the function (in other words, it "checkpoints" the function call). Contrary to my earlier expectation it seems that we can (and must) benefit from a BOG in these kernels so we will need https://github.com/microsoft/knossos-ksc/issues/818.

https://github.com/microsoft/knossos-ksc/blob/e0fe83263828ad9080f78aec3fff78ee8cd87b46/src/python/ksc/torch_frontend.py#L448-L459

(generated from 2c1799327759f7c220ac5c010ea5da2e1489d0bd)

How we discovered the BOG on PyTorch objects:

>>> import torch
>>> import torch.autograd
>>> inputs = torch.randn(10)
>>> inputs.requires_grad = True
>>> x = torch.erf(inputs)
>>> x.grad_fn._saved_self
tensor([ 0.2379,  0.9197, -0.4796,  0.8679,  0.2679,  1.2065, -0.2523, -0.9073,
         0.1340, -1.2081], requires_grad=True)

microsoft / knossos-ksc

gelu upper bounds (vectorised erf) #979