Why the sigmoid in the transformer?

maxjcohen / transformer

Implementation of Transformer model (originally from Attention is All You Need) applied to Time Series.

https://timeseriestransformer.readthedocs.io/en/latest/

GNU General Public License v3.0

842 stars 165 forks source link

Why the sigmoid in the transformer? #49

Closed diegoquintanav closed 2 years ago

diegoquintanav commented 3 years ago

I don't understand why there's a sigmoid in

https://github.com/maxjcohen/transformer/blob/2ebed9c4027199d491288f755a12adba6b42d727/tst/transformer.py#L152

As far as I understand, the sigmoid maps an output to the range [0,1] which is useful in binary classification problems. I don't get how it works here though.

maxjcohen commented 3 years ago

This activation function ensures output are constrained to [0, 1], and as most activation function, it increases the expressiveness of the network. You should be able to remove it without any problem !

diegoquintanav commented 3 years ago

Right, but here is not acting as an activation function, is acting as the final layer that maps the outcome of the last decoder to meaningful values in the context of the problem. What I mean is that using a sigmoid only works if the input target is scaled to 0,1 (by using min-max scaling). Moreover, without scaling, you would be propagating the gradient of a loss that does not make sense, like $L(y{unscaled}, y{\in [0,1]})$.

I hope my understanding is correct :D thanks for replying!

maxjcohen commented 3 years ago

Although the sigmoid is added after the last layer, I believe it still acts as an activation function, I can't see why it couldn't. Here, the dataset is indeed scaled to [0, 1], see https://github.com/maxjcohen/transformer/blob/2ebed9c4027199d491288f755a12adba6b42d727/src/dataset.py#L73-L80

diegoquintanav commented 3 years ago

What I meant is that since this is not a binary classifier, using a sigmoid, in the end, is not meaningful... and in this case, it happens to work -- the loss is numerically valid and you get real values in the 0,1 range that can be compared to a scaled target that happens to be in the same range. I think this should be more evident when no scaling is applied.

But I may be wrong. That's what I'm trying to figure out :^) thanks again.

AtrCheema commented 3 years ago

@maxjcohen What if the kind of activation on the last layer is made more dynamic i.e. the user can choose what kind of activation he/she wants to use. This will make the Transformer class more dynamic for usage. Also I see in many cases people recommending not to use any activation on the output layer.

maxjcohen commented 3 years ago

Alright, so the reason I added this activation function is because, just as any other activation function, it increases the expressiveness of the network. As @diegoquintanav stated, we are not in the case of a binary classifier, but activation functions are not limited to this use case. As I previously mentioned, removing this layer shouldn't require to much work. If you wish to dynamically allow the user to chose the activation function (or absence of), you are welcome to send a PR, I'll review it right away.