rikhuijzer / SIRUS.jl

Interpretable Machine Learning via Rule Extraction
https://sirus.jl.huijzer.xyz/
MIT License
31 stars 2 forks source link

Code issues in the paper #54

Closed sylvaticus closed 1 year ago

sylvaticus commented 1 year ago

Hello, I am trying to replicate the example in the JOSS paper but I have some issues.

  1. First to reproduce it I need to use the whole register_haberman() and _haberman() functions. This data processing step distracts from the objective of showing SIRUS, so perhaps you can find some other simple way to retrieve an already-nicelly-formatted dataset for work on it, or you may want to put X and y already nicely formatted somewhere that you (and the reader) can retrieve them with a HTTP.get or a Downloads.download...

  2. You are mixing MLJ and non-MLJ API together and I think this is distracting.. for example model = StableRulesClassifier(; max_depth=2, max_rules=8) is not MLJ API. Instead it should be:

    modelType   = @load StableRulesClassifier pkg = "SIRUS"
    model       = modelType() 
  3. Related to the previous point, I can't find in the MLJ constructor the option max_rule that you describe in the paper

  4. The mach.fitresult call to me doesn't return the nice rule base that you present in the paper but rather a ugly:

    julia> mach.fitresult
    SIRUS.StableForest{Float64}(Union{SIRUS.Leaf, SIRUS.Node}[SIRUS.ClassificationLeaf([0.5, 0.5, 0.0]), SIRUS.ClassificationLeaf([0.5, 0.0, 0.5]), SIRUS.ClassificationLeaf([0.0, 0.5, 0.5]), SIRUS.ClassificationLeaf([0.0, 0.0, 1.0]), SIRUS.ClassificationLeaf([0.0, 0.0, 1.0]), SIRUS.ClassificationLeaf([0.5, 0.0, 0.5]), SIRUS.ClassificationLeaf([0.5, 0.0, 0.5]), SIRUS.ClassificationLeaf([0.5, 0.5, 0.0]), SIRUS.ClassificationLeaf([1.0, 0.0, 0.0]), SIRUS.ClassificationLeaf([1.0, 0.0, 0.0])  …  SIRUS.ClassificationLeaf([0.5, 0.5, 0.0]), SIRUS.ClassificationLeaf([0.5, 0.5, 0.0]), SIRUS.ClassificationLeaf([0.0, 1.0, 0.0]), SIRUS.ClassificationLeaf([0.5, 0.5, 0.0]), SIRUS.ClassificationLeaf([0.5, 0.0, 0.5]), SIRUS.ClassificationLeaf([0.5, 0.0, 0.5]), SIRUS.ClassificationLeaf([0.0, 0.0, 1.0]), SIRUS.ClassificationLeaf([0.5, 0.0, 0.5]), SIRUS.ClassificationLeaf([0.5, 0.0, 0.5]), SIRUS.ClassificationLeaf([0.0, 1.0, 0.0])], SIRUS.Classification(), [0.0, 1.0, 2.0])
  5. I have tried to apply the model to the iris dataset but I got predictions that don't even define a probability distribution (bug for multiclasses??):

julia> (X,y)       = @load_iris;
julia> modelType   = @load StableRulesClassifier pkg = "SIRUS";
[ Info: For silent loading, specify `verbosity=0`. 
import SIRUS ✔
julia> model       = modelType() ;
julia> mach        = machine(model, X, y);
julia> fit!(mach);
[ Info: Training machine(StableForestClassifier(rng = Random.TaskLocalRNG(), …), …).
[ Info: Converting outcome classes ["setosa", "versicolor", "virginica"] to [0, 1, 2].
julia> est_classes = predict(mach, X)
150-element CategoricalDistributions.UnivariateFiniteVector{Multiclass{3}, Float64, UInt8, Float64}:
 UnivariateFinite{Multiclass{3}}(0.0=>0.5, 1.0=>0.5, 2.0=>0.5)
 UnivariateFinite{Multiclass{3}}(0.0=>0.5, 1.0=>0.5, 2.0=>0.5)
 UnivariateFinite{Multiclass{3}}(0.0=>0.5, 1.0=>0.5, 2.0=>0.5)

I believe that it is useful to show an example that for as much concise and simple as possible, it is also reproducible.

https://github.com/openjournals/joss-reviews/issues/5786

rikhuijzer commented 1 year ago

This is a nice suggestion. Thank you for trying SIRUS out. Guillaume Dalle agrees on this issue in https://github.com/rikhuijzer/SIRUS.jl/issues/49. I've added a basic example to https://sirus.jl.huijzer.xyz/dev/basic-example/ and am adding this to the paper too. I'll request a PDF generate in the review issue once I'm done updating the paper.

  1. First to reproduce it I need to use the whole register_haberman() and _haberman() functions.

This should now be solved. It's still 9 lines of code, but I hope this is okay. I really like the DataDeps checksum to ensure that the dataset is the right one.

  1. You are mixing MLJ and non-MLJ API together and I think this is distracting..

You're right on that. My reasoning was that the MLJ API is non-standard Julia because @load in the background loads symbols from packages. I'm not a fan of such dynamic loading because it has bitten me a few times when trying to get R packages to work. Still, I have incorporated your suggestion in the tutorial and show both ways. I hope that's okay.

  1. Related to the previous point, I can't find in the MLJ constructor the option max_rule that you describe in the paper

You found a bug! Thank you. It turns out that @load StableRulesClassifier pkg="SIRUS" actually loaded the StableForestClassifier (which doesn't have the max_rules hyperparamete) instead of the StableRulesClassifier. I hope that https://github.com/JuliaAI/MLJModels.jl/pull/526 will be merged so that the issue gets resolved. Until then, using SIRUS: StableRulesClassifier works.

  1. The mach.fitresult call to me doesn't return the nice rule base that you present in the paper but rather a ugly:

This is also the result of StableForestClassifier versus StableRulesClassifier.

  1. I have tried to apply the model to the iris dataset but I got predictions that don't even define a probability distribution (bug for multiclasses??):

Do you mean that the output should be a different type than UnivariateFinite? I chose this type because it's the MLJ.jl standard. Also, the random forest model, with these predictions, is tested via MLJ and achieves quite good performance scores, see for example the GitHub Actions Job Summary at https://github.com/rikhuijzer/SIRUS.jl/actions/runs/6171051743/attempts/1#summary-16748537050. The summary shows that the StableForestClassifier often performs similar to the XGBoostClassifier.