the loss proposed in the paper

Oussamab21 commented 5 years ago

Hi thank you for sharing this code but I had a question : what do you think about the first log used inside the loss function that you used in line 199 what is the aim of this log

Also why did you move the minus near the second log not the first log as stated in the paper.?

Thank you

lmartak commented 5 years ago

what do you think about the first log used inside the loss function that you used in line 199 what is the aim of this log

So as the path-probability-weighted sum of cross-entropies of the tree leaves goes to 0, the loss goes to -Inf, which imho helps to maintain higher sensitivity to small improvements during convergence. Without it, the error surface gives much weaker signals down the road about making small improvements form an already locally optimal solution.

why did you move the minus near the second log not the first log as stated in the paper.?

I am assuming this is just a typo in the paper. If you try changing the minus to correspond to the paper formula (3), you get NaN loss because the term in the log() will always be negative.

Oussamab21 commented 5 years ago

Hi thank you for your reply , But what I still don't get is the fact that they use a loss which is different from Classical cross entropy loss ? @lmartak

lmartak commented 5 years ago

they use a loss which is different from Classical cross entropy loss

Naturally, but the difference only stems from the need to accommodate for the difference between a single-output model architectures and this particular "hierarchical mixture of bigots" architecture.

In other words, Neural Net with a single output layer only needs a single cross-entropy for the loss term. Binary Soft Decision Tree as proposed in the paper (which in this particular instantiation implements a Hierarchical Mixture of Bigots) on the other hand has multiple outputs (as many as there are leaves in the tree) and each one of them is used to calculate different cross-entropy term which contributes to the total loss with a proportion specified by path probability of given leaf. This mechanism is re-used from the concept of Hierarchical Mixture of Experts. Path probabilities distribute the "responsibility" for predictions among the leaves (experts).

The term Expert is replaced with term Bigot in this instantiation, because Expert is usually a model which infers its prediction from the input, whereas in this paper, each leaf is a learned set of parameters (determined during training) that is a "class prediction" on its own, independent from the input. After training, during inference, the Bigots don't even take a look at the data for predictions, each Bigot will always predict the same class based on what it'd learned to predict during the training. The only thing that determines the prediction of the whole tree is the path from root to a specific bigot guided by inner nodes, governing the choice of path with maximum probability, inferred from the input.

I guess this explains the bulk of the model proposed in the paper. Feel free to draw this for yourself to internalize the concept.

Good luck!

lmartak / distill-nn-tree

the loss proposed in the paper #4