Closed selfint closed 3 years ago
Some more thoughts on 2: Softmax and CCE loss only make sense when used together AFAIK. When calculating the gradients, we can branch into 2 separate options - softmax + cce / standard. In other words classification / regression.
Cons:
Option 3: Implement classification/regression as a network trait.
This means that anyone can add their own 'mode', and optimizers can be implemented for any mode.
Softmax derivative and activation implementations (+cce loss) should be implemented together
Option 4: Do not implement softmax as an activation, and bake it into cce loss.
Explanation:
That is why I think we can safely use softmax only when computing the loss of the network. It doesn't have to be only for CCE, but it will stay there until some other loss function needs it. Worst case, it is a very simple implementation so duplicating it shouldn't be that bad.
As I understand it now, softmax needs the activation of the node (which can be calculated easily), but also the expected activation of said node. Currently the derivative of an activation function is expected to only require the transfer, from which it can calculate its own activation if needed.
To implement the softmax derivative we can either: