I was in the 05/09/2018 class before TrainAI conference, and one peer student reported better accuracy when replacing categorical_crossentropy with binary_crossentropy, and I experienced that improvement too on 2 architectures (perceptron, mlp) and possibly more.
I'd like to ask/discuss here, what the mathematics look like when apply binary cross entropy loss function to multiple-label classification? I'm speculating that the way it's enforced (although binary is supposed to work with 2 labels) happen to benefit accuracy in this problem.
Toy example and my guess:
label = [0, 0, 1, 0, 0], predict = [0.1, 0.1, 0.6, 0.1, 0.1]categorical_crossentropy(label, predict) = -log(0.6)binary_crossentropy(label, predict) = -log(0.6)-4*log(0.9)
Interesting observation! If you're still interested in this question, I recommend you ask it on our Slack forum for ML engineers and enthusiasts: bit.ly/slack-forum.
I was in the 05/09/2018 class before TrainAI conference, and one peer student reported better accuracy when replacing
categorical_crossentropy
withbinary_crossentropy
, and I experienced that improvement too on 2 architectures (perceptron, mlp) and possibly more.I'd like to ask/discuss here, what the mathematics look like when apply binary cross entropy loss function to multiple-label classification? I'm speculating that the way it's enforced (although binary is supposed to work with 2 labels) happen to benefit accuracy in this problem.
Toy example and my guess:
label = [0, 0, 1, 0, 0], predict = [0.1, 0.1, 0.6, 0.1, 0.1]
categorical_crossentropy(label, predict) = -log(0.6)
binary_crossentropy(label, predict) = -log(0.6)-4*log(0.9)