the loss function for multi label learning

In your paper you have defined a particular loss function in equation (4). I wanted to know where this loss function is derived from? Is there a source for it or is it your contribution?

Also when you're implementing this in your code: https://github.com/Microsoft/FERPlus/blob/master/src/train.py#L44

you're calculating the logarithm after finding the maximum not on the predictions which is shown as log(q) in equation(4). Is there a particular reason for this? I would appreciate more elaboration

microsoft / FERPlus

the loss function for multi label learning #8