ml-research / pau

Padé Activation Units: End-to-end Learning of Activation Functions in Deep Neural Network
62 stars 8 forks source link

Per-iteration speed of PAU compared with other activation functions #1

Closed AranKomat closed 5 years ago

AranKomat commented 5 years ago

First of all, thank you for your interesting work. I'm going to try PAU on Transformer.

For training or inference, I believe the per-iteration time of using PAU is larger than that of using other activation functions such as ReLU. But the difference is pretty much negligible in practice, right?

You said that it takes too much time without weight sharing. Can't the non-weight-shared version become efficient with some trick in implementation? I don't know, but I'd be very much interested in its performance.

You said ratoinal function approximation is superior to polynomial approximation. Have you tried any other function family for approximation (e.g. Fourier series)?

PatrickSchrML commented 5 years ago

Hey, we are glad that you want to try it! Let us know if you need any help installing it and getting it to run. We are excited to see how it works for you!

To your questions: The per-iteration time of our CUDA implementation is not much larger compared to other activation function. It depends on your Hardware. In our experiments on GTX 1080 TI and V100 we observed around 1%-5% overhead, ymmv.

Currently we are sharing weights (one PAU per layer) to keep the parameter space small for now. But we are also very interested in a non-weight-shared version. The time should not differ much for the forward pass as we already compute PAU per neuron, the backward pass will be slower as you have many more parameters.

Our claim regarding polynomial approximations is based on [1]. They show that when your activation functions are polynomials, you don't get a universal approximator. Pau however, allows the network to be a universal approximator. However, we didn't try out other approximators.

[1] M. Leshno, V. Y. Lin, A. Pinkus, and S. Schocken. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks, 6(6):861–867,

  1. (http://www2.math.technion.ac.il/~pinkus/papers/neural.pdf)