thomasbrandon / mish-cuda

Mish Activation Function for PyTorch
MIT License
147 stars 67 forks source link

Memory utilization during training. #20

Open dtmoodie opened 2 years ago

dtmoodie commented 2 years ago

As far as I can tell from the source code, this activation doesn't need to cache values to calculate gradients since it recalculates the forward pass during the backwards pass: https://github.com/thomasbrandon/mish-cuda/blob/master/csrc/mish.h#L26 Is this an accurate statement? I'm sorry if this is dumb, I haven't written any c++ pytorch code so I'm not sure how their API works for caching activations.