Performance evaluation may hint at bug?

gdalle commented 1 year ago

In the paper, you state

For the multiclass Iris classification and the Boston Housing regression datasets, the performance was worse than the other models. It could be that this is caused by a bug in the implementation or because this is a fundamental issue in the algorithm

Isn't it possible to compare with the numerical results of the original paper to validate the implementation?

https://github.com/openjournals/joss-reviews/issues/5786

gdalle commented 1 year ago

See also #50

rikhuijzer commented 1 year ago

Isn't it possible to compare with the numerical results of the original paper to validate the implementation?

Sorry for taking so long to respond. This was a very good question and it took me a while to find time to dig into it. It's a good point. I hadn't considered that.

Firstly, I'll summarize the results here from what we've reported in our paper and what the original paper reported for the Diabetes, Haberman, and Titanic dataset:

Dataset	SIRUS	SIRUS.jl	Difference
Diabetes	1 - 0.19 = 0.81	0.75 ± 0.05	-7%
Haberman	1 - 0.35 = 0.65	0.67 ± 0.06	5%
Titanic	1 - 0.17 = 0.83	0.83 ± 0.02	0%

Here, SIRUS scores come from the Project Euclid by Benard et al. (2021; Table 4). I've converted the 1 - AUC scores from that paper back to AUC scores. The SIRUS.jl scores come, again, from the CI run for version 1.3.2 with max_rules = 10.

Given that the scores are reasonably similar while the cross-validation splits differ, I see no reason to believe that SIRUS.jl performs worse or better than SIRUS on these three classification datasets.

For regression, I'll try to compare to the datasets from their MLResearchPress paper at http://proceedings.mlr.press/v130/benard21a. There they present unexplained variance in Table 3.

I'll get to this at the end of this week or at the beginning of next week. Apologies for the delay. I keep having PhD-related stuff popping up.

rikhuijzer / SIRUS.jl

Performance evaluation may hint at bug? #51