Remarks on the binary classification tutorial

gdalle commented 1 year ago

As part of my JOSS review, here is a first batch of comments on the tutorial:

auxillary nodes (Random forests & other sections)

axillary nodes, presumably

The model has learned three rules for this dataset. (Interpretation)

Isn't it 8?

This is done for all the rules and, finally, the rules are summed to obtain the final prediction. (Interpretation)

It took me a while to understand that they are weighted first. At the very least, saying average instead of sum would be clearer, but weighted average would be best.

By the way, do the weights sum to 1? And how are they obtained, since frequency ranking of the rules is already leveraged for pruning the ruleset?

The x-position on the left shows log(else-scores / if-scores) (Visualization)

Not sure what is meant here. I don't really understand the plots as a result.

all rules (fitted in the different cross-validation folds)

Shouldn't there be $8f$ rules, where $f$ is the number of folds? I see much fewer

https://github.com/openjournals/joss-reviews/issues/5786

rikhuijzer commented 1 year ago

I've fixed most points in https://github.com/rikhuijzer/SIRUS.jl/commit/acee8b4b6d50bc4030aa17f3a872750736432972.

axillary nodes, presumably

Correct. Fixed.

Isn't it 8?

Correct. Fixed by linking the text to the actual Julia object.

It took me a while to understand that they are weighted first. At the very least, saying average instead of sum would be clearer, but weighted average would be best.

It should be fixed now.

By the way, do the weights sum to 1? And how are they obtained, since frequency ranking of the rules is already leveraged for pruning the ruleset?

Great question. I've tried to address this in the Implementation Overview in the docs. After reading the original papers and emailing with the original author for clarification, this is what I wrote about it in the docs:

Finally, the weights are determined by converting the training data to a rule space. According to Clément Bénard, the best way is to fit a regression model on a dataset where each rule is a binary feature. Furthermore, the weights can be stabilized by using an L2-penalty (ridge); this helps because the rules are quite strongly correlated. He also advised against the L1-penalty (lasso) as it would introduce additional sparsity and then instability of the rule selection since lasso is unstable with correlated features. Finally, the weights are constrained to all be positive since this eases interpretability.

and the code is at https://github.com/rikhuijzer/SIRUS.jl/blob/main/src/weights.jl; especially at

https://github.com/rikhuijzer/SIRUS.jl/blob/acee8b4b6d50bc4030aa17f3a872750736432972/src/weights.jl#L34-L77

Line 71 to 76 shows that the weights sum, indeed, roughly to 1. Interestingly, the choice of weights have almost no effect on the classification case! For regression, though, it does. Therefore, I suspect that the choice of weight is suboptimal and that that explains the low performance on regression tasks. I'll respond in more detail in https://github.com/rikhuijzer/SIRUS.jl/issues/51.

Shouldn't there be rules, where is the number of folds? I see much fewer

Very well spotted!!! There was a bug in the plotting function which is now fixed in https://github.com/rikhuijzer/SIRUS.jl/commit/acee8b4b6d50bc4030aa17f3a872750736432972.

gdalle commented 1 year ago

okay by me!

rikhuijzer / SIRUS.jl

Remarks on the binary classification tutorial #47