Comparison to Mish activation

ml-research / rational_activations

Rational Activation Functions - Replacing Padé Activation Units

MIT License

63 stars 13 forks source link

Comparison to Mish activation #9

Open kayuksel opened 3 years ago

kayuksel commented 3 years ago

Mish is the current most popular activation function. Thus, it would be good if you can also compare with it.

k4ntz commented 3 years ago

Yes, we want to compare to it too, I'll upload updated graphs soon.

kayuksel commented 3 years ago

@k4ntz It performed better than Mish in my case (also RL-like)

kayuksel commented 3 years ago

Any tips on how to active? I used kaiminguniform for Linear layers.

k4ntz commented 3 years ago

Hi @kayuksel, thanks for these info ! Could you share some graph or link to result (even draft) showing that. We are also working for comparison against GeLU in transformers. For the initialisation, you can use xavier as in our imagenet classification task (https://github.com/ml-research/rational_sl/blob/main/imagenet/train_imagenet.py) and cifar (https://github.com/ml-research/rational_sl/blob/main/cifar/train.py). If I remember correctly, it empirically works better than kaiming (for big nets). Please share any other result you might have !

kayuksel commented 3 years ago

@k4ntz I can provide the learning curves using both but I am unable to provide a draft publication yet as it is patent-pending. Thanks for suggesting the initialization function, I will then also try with that.

kayuksel commented 3 years ago

@k4ntz FYI, my case is an adversarial setting so it is important that it re-adapt itself continuously. Thus, having also adaptable activation functions (besides the network itself) may be good in that sense.

Also, I prefer the network to overfit so the generalization is not a major concern (in case using an adaptable activation function may increase the chance of over-fitting for the model for certain tasks)

k4ntz commented 3 years ago

Yes, using such activation functions provides the network with more modelling capacities, the transformation of the manifold through the layers might be more accurate if needed. I don't think Rational AFs overfit per se, but if they are used in network cherry pick to well perform at a task, this additional modelling power provides overfitting. We are also working on pruning rational nets. Whenever results are available, they will be provided. :)