rikhuijzer / SIRUS.jl

Interpretable Machine Learning via Rule Extraction
https://sirus.jl.huijzer.xyz/
MIT License
31 stars 2 forks source link

Comparison with original implementation #50

Closed gdalle closed 1 year ago

gdalle commented 1 year ago

I couldn't find the original implementation linked in any of the papers by the SIRUS authors, did I miss something?

In addition to a link, it would be nice to provide quantitative arguments for the claims found eg. in the docs:

this Julia implementation is more easy to inspect

this implementation is fast

https://github.com/openjournals/joss-reviews/issues/5786

rikhuijzer commented 1 year ago

Thanks for your suggestions and doing the JOSS review. I'll use these comments to document the responses to the issues that you raised.

In addition to a link, it would be nice to provide quantitative arguments for the claims found eg. in the docs:

Yes you're right about the use of "this implementation is fast" in the documentation. Apart from optimizing a bit (e.g., https://github.com/rikhuijzer/SIRUS.jl/commit/7ef3cb59ad9c7a14c97234f91933dbaf883b7fbb, https://github.com/rikhuijzer/SIRUS.jl/commit/60b56b2ce52a4816d8aeb209b886c822ade99cd7, https://github.com/rikhuijzer/SIRUS.jl/commit/5797c8aa6bacb0e8951b1d467a1c3b5ac8376ba2, https://github.com/rikhuijzer/SIRUS.jl/commit/82c90153cb484f4b45d19275b57f37a0555b8f8c, and https://github.com/rikhuijzer/SIRUS.jl/commit/141d61b4b4ef2e3528381a8d0381d32cc474712c), I haven't spent much time on it so it's best to not make any claims there. I've removed the sentence from the docs (https://github.com/rikhuijzer/SIRUS.jl/commit/d502a4e15a832920e1135d9dcf3c70c5139b1e6d).

I've added quantitative arguments in the form of lines of code (5f6bbd142fcd31c4aa9da871104a775e304ef0c5). It's a really nice idea to add those metrics since it clearly shows the difference between C++ and R (10k + 2k = 12k lines of code) versus Julia (2k lines of code). I've obtained these numbers via

 ~/git/sirus (master) > wc -l src/*
       6 src/AAA_check_cpp11.cpp
     275 src/Data.cpp
     235 src/Data.h
      48 src/DataChar.cpp
      71 src/DataChar.h
      69 src/DataDouble.h
      36 src/DataFloat.cpp
      65 src/DataFloat.h
      84 src/DataRcpp.h
      43 src/DataSparse.cpp
      75 src/DataSparse.h
    1035 src/Forest.cpp
     273 src/Forest.h
     334 src/ForestClassification.cpp
      77 src/ForestClassification.h
     342 src/ForestProbability.cpp
      77 src/ForestProbability.h
     255 src/ForestRegression.cpp
      56 src/ForestRegression.h
     362 src/ForestSurvival.cpp
      71 src/ForestSurvival.h
     106 src/globals.h
       4 src/Makevars
       4 src/Makevars.win
     399 src/rangerCpp.cpp
     118 src/RcppExports.cpp
     596 src/Tree.cpp
     191 src/Tree.h
     762 src/TreeClassification.cpp
     111 src/TreeClassification.h
     766 src/TreeProbability.cpp
     120 src/TreeProbability.h
     723 src/TreeRegression.cpp
      96 src/TreeRegression.h
     910 src/TreeSurvival.cpp
     117 src/TreeSurvival.h
     589 src/utility.cpp
     537 src/utility.h
      39 src/utilityRcpp.cpp
   10077 total

~/git/sirus (master)> wc -l R/*
      46 R/formula.R
     708 R/ranger.R
      19 R/RcppExports.R
     856 R/sirus.R
     550 R/sirus_utility.R
      74 R/utility.R
    2253 total

 ~/git/SIRUS.jl (main) > wc -l src/*
      90 src/classification.jl
     268 src/dependent.jl
      62 src/empiricalquantiles.jl
     385 src/forest.jl
      27 src/helpers.jl
     403 src/mlj.jl
      59 src/regression.jl
     408 src/rules.jl
      87 src/ruleshow.jl
      47 src/SIRUS.jl
      97 src/weights.jl
    1933 total

I couldn't find the original implementation linked in any of the papers by the SIRUS authors, did I miss something?

A link has been added to the docs (https://github.com/rikhuijzer/SIRUS.jl/commit/d502a4e15a832920e1135d9dcf3c70c5139b1e6d) and paper (https://github.com/rikhuijzer/SIRUS.jl/pull/36/commits/b60215d7c0fa65f464e8ccffecf014f9a8131064).

gdalle commented 1 year ago

very cool!