Memory utilization during training.

As far as I can tell from the source code, this activation doesn't need to cache values to calculate gradients since it recalculates the forward pass during the backwards pass: https://github.com/thomasbrandon/mish-cuda/blob/master/csrc/mish.h#L26 Is this an accurate statement? I'm sorry if this is dumb, I haven't written any c++ pytorch code so I'm not sure how their API works for caching activations.

thomasbrandon / mish-cuda

Memory utilization during training. #20