Closed diegoquintanav closed 2 years ago
This activation function ensures output are constrained to [0, 1]
, and as most activation function, it increases the expressiveness of the network. You should be able to remove it without any problem !
Right, but here is not acting as an activation function, is acting as the final layer that maps the outcome of the last decoder to meaningful values in the context of the problem. What I mean is that using a sigmoid only works if the input target is scaled to 0,1 (by using min-max scaling). Moreover, without scaling, you would be propagating the gradient of a loss that does not make sense, like $L(y{unscaled}, y{\in [0,1]})$.
I hope my understanding is correct :D thanks for replying!
Although the sigmoid is added after the last layer, I believe it still acts as an activation function, I can't see why it couldn't. Here, the dataset is indeed scaled to [0, 1]
, see https://github.com/maxjcohen/transformer/blob/2ebed9c4027199d491288f755a12adba6b42d727/src/dataset.py#L73-L80
What I meant is that since this is not a binary classifier, using a sigmoid, in the end, is not meaningful... and in this case, it happens to work -- the loss is numerically valid and you get real values in the 0,1 range that can be compared to a scaled target that happens to be in the same range. I think this should be more evident when no scaling is applied.
But I may be wrong. That's what I'm trying to figure out :^) thanks again.
@maxjcohen What if the kind of activation on the last layer is made more dynamic i.e. the user can choose what kind of activation he/she wants to use. This will make the Transformer class more dynamic for usage. Also I see in many cases people recommending not to use any activation on the output layer.
Alright, so the reason I added this activation function is because, just as any other activation function, it increases the expressiveness of the network. As @diegoquintanav stated, we are not in the case of a binary classifier, but activation functions are not limited to this use case. As I previously mentioned, removing this layer shouldn't require to much work. If you wish to dynamically allow the user to chose the activation function (or absence of), you are welcome to send a PR, I'll review it right away.
I don't understand why there's a sigmoid in
https://github.com/maxjcohen/transformer/blob/2ebed9c4027199d491288f755a12adba6b42d727/tst/transformer.py#L152
As far as I understand, the sigmoid maps an output to the range [0,1] which is useful in binary classification problems. I don't get how it works here though.