meyer-lab / mechanismEncoder

Developing patient-specific phosphoproteomic models using mechanistic autoencoders
4 stars 1 forks source link

Parameter transformation from inflation to mechanistic model #15

Closed FFroehlich closed 3 years ago

FFroehlich commented 3 years ago

The encoder needs to expand the latent space to coefficients that modulate model parameters. This is currently implemented as

np.power(10, T.nnet.sigmoid(T.dot(embedded_data, W_p) + bias)*2*a - a)

This as the advantage that the ouput is well behaved insofar that the coefficients live in logarithmic space and are bounded by [-10a,10a] (a=1 as of now). It is straightforward to change those bounds by tuning a, but we will always end up with a bounded space, which may be problematic for training, since this may introduce local critical points and lead to small gradients in the encoder parts. One could of course drop the sigmoid transformation to remove bounds, but that previously led to issues with numerical integration (which may be fixable by smarter initialization schemes though). @aarmey are there any other commonly used activation functions beyond sigmoids and relu?

aarmey commented 3 years ago

This should be fine. I'm not sure this is exactly the same as the vanishing gradients problem because there aren't multiple layers of these bounds. If you get parameters way off where the gradients explode in the reverse pass, l2 regularization of the parameter values should help.

FFroehlich commented 3 years ago

I don't think this is the classical vanishing gradients problem that appears when stacking multiple layers. For |x| >> 0 the gradient of T.nnet.sigmoid(x) simply is very small and convergence will be slow. L2 regularization of x will definitely help, but I am not sure whether this will addresses the problem of minima at the boundaries.

aarmey commented 3 years ago

I agree it's a potential concern. The only two solutions I know of, though, are (1) making sure your boundaries are generous enough to avoid them, and (2) regularization to again avoid the boundaries. Luckily, you'll be able to see if any parameters are off in a region with shrunken gradients.

Flux (although in Julia) has a very nice list of commonly used activation functions. They will all have this issue, though.

https://fluxml.ai/Flux.jl/stable/models/nnlib/

Actually, maybe Swish would be one potentially interesting alternative? It's non-monotonic, so it sort of provides a local minimum at the boundary to avoid going off into shrunken gradient space.

https://arxiv.org/abs/1710.05941

FFroehlich commented 3 years ago

I agree swish does indeed look like a good alternative, will investigate the problem a bit more thouroughly and then check whether switching to swish helps.