A more comprehensive benchmark would study different datasets of different structure (e.g. dense, sparse, etc.), how results are changing as a function of hyper parameter values, the scaling out to multiple nodes for the distributed systems, larger datasets, and the like.
One can see that the linear models’ accuracy increases only a little from 100K to 1M and it is virtually the same for 1M and 10M. This is because a simple linear structure can be extracted from a smaller dataset and having more data points will not change the classification boundary significantly. On the other hand, more complex models such as random forests can further improve with increasing data size by further adjusting the classification boundary.
An interesting point to note is that the AUC for the random forest trained on 100K observations is better than the AUC on a linear model trained on 10M observations. Therefore the answer to the question “more data or better algorithms?” often posed nowadays is that it depends (rather than the often argued “more data usually beats better algorithms”).
Without taking the time to understand the tradeoffs and make an informed decision, many will spend a large part of their time fighting the tools instead of gaining value and insights from their data.
Biopython Medicine?