Reduce package dependencies (e1071, and randomForest)?

pa-nathaniel commented 1 year ago

I am struggling to install FFTrees on a machine due to issues installing randomForest (due to some issues with a dependency with an M2 mac). Really frustrating and feels like a shame to have all of the great FFTrees functionality gated on being able to use randomForest as a competitive algorithm.

This gets me to wonder, what would the pros and cons be of reducing dependencies? Generally including non-essential dependencies is discouraged, and the more that I think about it, randomForest, and other packages used as competitive algorithms, are definitely not essential for seeing the benefits of FFTrees.

How about removing randomForest, and maybe e1071 (for svm()) as dependencies and just using rpart::cart() and lr as competitive algorithms?

I feel like 99.9% of users won't miss it and it could reduce the barrier to entry.

@hneth what do you think?

hneth commented 1 year ago

I haven't experienced this barrier, so far, but you're raising a valid and important point, of course.

So far, I've been viewing the competitive algorithms in the FFTrees package as a nice add-on with more benefits than costs. Given the lack of a generally accepted gold standard and the availability of a vast range of possible classification strategies, it's crucial to compare the performance of FFTs to some alternative models. The current range links and contrasts our seemingly naive trees with fancier methods typically associated with buzz-words like "statistical modeling" and "machine learning". And while I suspect that many users appreciate the automatic availability of such performance benchmarks, it's highly undesirable when enabling these benchmarks prevents them from installing and using FFTrees.

Hence, perhaps the key questions and trade-offs here are:

What proportion of users is lost due to such dependency issues?
Would winning them outweigh the costs of existing users for creating their own benchmarks?
What other costs do we incur by excluding or including the benchmarks?

With regards to 3.: Beyond their technical demands, another critical issue with highly sophisticated alternative benchmarks is that our default usage often fails to exploit their full capacity. This is unavoidable and to be expected, as we're not even trying to optimize the performance of those algorithms. But when then finding a superior solution (e.g., by using RLR instead of LR), enthusiasts of those alternative algorithms (or skeptics of FFTs) may then construe our omission into a general argument against simpler strategies. Hence, removing non-optimized alternatives could also preempt accusations that our competition is not "fair" or "objective" (which may often be justified — but not out of bias or malice, but simply because we're devoting more attention and effort on our favored model than on its alternatives).

ndphillips commented 5 months ago

I take your points. I suspect most people who want to compare the effectiveness of FFTrees to other algorithms should be using packages built for that purpose (such as tidymodels and parsnip) rather than using the (somewhat hacky) solutions we built into this package.

I think it would be wise to

Remove all competing algorithms from FFTrees
Update version number to a new minor (or major?) number to indicate this is a breaking change
Provide a link to other packages and recommended workflows if people want to do a comparison between FFTrees and other algorithms.

I'll create a PR for this but since it's a major change I won't merge until getting a review from @hneth

ndphillips / FFTrees

Reduce package dependencies (e1071, and randomForest)? #180