Loss Function: categorical_crossentropy and binary_crossentropy

I was in the 05/09/2018 class before TrainAI conference, and one peer student reported better accuracy when replacing categorical_crossentropy with binary_crossentropy, and I experienced that improvement too on 2 architectures (perceptron, mlp) and possibly more.

I'd like to ask/discuss here, what the mathematics look like when apply binary cross entropy loss function to multiple-label classification? I'm speculating that the way it's enforced (although binary is supposed to work with 2 labels) happen to benefit accuracy in this problem.

Toy example and my guess: label = [0, 0, 1, 0, 0], predict = [0.1, 0.1, 0.6, 0.1, 0.1] categorical_crossentropy(label, predict) = -log(0.6) binary_crossentropy(label, predict) = -log(0.6)-4*log(0.9)

lukas / ml-class

Loss Function: categorical_crossentropy and binary_crossentropy #34