mala-project / mala

Materials Learning Algorithms. A framework for machine learning materials properties from first-principles data.
https://mala-project.github.io/mala/
BSD 3-Clause "New" or "Revised" License
81 stars 26 forks source link

Make activation applied to the output layer optional #567

Open elcorto opened 2 months ago

elcorto commented 2 months ago

I saw @nerkulec's branch which made me realize that by default, FeedForwardNet applies an activation to the output layer. This is somewhat unusual for regression (see e.g. here eq. 5.1.3).

If this is done to enforce positive LDOS values, then one probably needs to fix this to ReLU only.

However, having any activation here (also ReLU) prevents using a MALA model together with a fast last layer code path in https://github.com/aleximmer/Laplace. There is an alternative way to do the same but this has not been benchmarked extensively for the MALA case so far.

In any case, I think such an activation should be opt-in (or -out). It should however not removed, in order to have backward compatibility with results already produced with the current code version. One could for instance introduce a bool setting parameters.network.use_output_activation. Then, the logic here

https://github.com/mala-project/mala/blob/0fed9cc707979d2cdc0227c7a3eed4fd1fbbd291/mala/network/network.py#L214-L224

needs to be adapted as well.

nerkulec commented 2 months ago

I created the branch after seeing that by default we have "normal" scaling + an activation function after the last layer of the network. But then I realized the defaults actually correspond to minmax scaling (as you noted in #482) and the sigmoid activation function, so this doesn't necessarily pose a huge problem as they have the same value range [0, 1]. After all, I still believe that the last layer shouldn't use an activation function, but I agree it should be left as an option for the user.

elcorto commented 2 months ago

In regression we generally don't want to clamp the model outputs to a range, which is equivalent to having the identity as activation. The intuition is that when the model faces unseen out-of-distribution data, outputs should be allowed to be "as wrong as possible", i.e. too large, for instance. When clamping to a range, model outputs might still be in the right ballpark, but for the wrong reasons. But I haven't tested this specific case, I have to admit.

Having ReLU as a last layer activation may be justified to enforce positive LDOS, while having the identity for positive values. But still this is problematic for the reason explained above. Enforcing positive outputs can also be done with a penalty term in the loss, maybe.

The Sigmoid activation is especially problematic I think, since (i) it clamps to a range and (ii) adds an non-linearity at the end which doesn't help the model to be more complex / expressive since it merely modifies values without passing them to other layers.

I have a hotfix in my dev branch which adds parameters.network.ff_use_output_activation (one cannot add a constructor keyword to FeedForwardNet due to the way __new__ does things in Network) but I haven't adapted the code snippet above, which is for the case when people supply different activations for different layers. One might also think about simplifying the API by letting people pass a torch.nn.Sequential object which gets called in Network.forward(), but that's a bit OT here :)