interpretml / interpret

Fit interpretable models. Explain blackbox machine learning.
https://interpret.ml/docs
MIT License
6.04k stars 714 forks source link

How much data is require for EBM??? #493

Open basnetpro3 opened 5 months ago

basnetpro3 commented 5 months ago

Is there any requirement of minimum data set size for EBM? I have read papers where other black box models are even performing good with dataset size around 50 or below 100, Can we use EBM for dataset size if we have data around 100 or between 100-200???

richcaruana commented 5 months ago

Because EBMs are a restricted model class so that they remain intelligible, their simplicity means that they do not need large amounts of data compared to some other model types such as neural nets or boosted decision trees. In practice their data complexity is more comparable to linear and logistic regression than it is to deep neural nets, but they do often need/benefit from more data than a linear model would. The more features in the dataset the more data you need to be able to learn an accurate model, and the more complex the function needed for each feature, the more data will be needed to shape those functions accurately, so it is difficult to give numbers without knowing more about the data and problem. Our experience is that useful models with a few dozen features can be trained on data with 500 or more cases if the data is not too imbalanced. I like to look at the size of the smallest important class when I think about data size. If there are 10k training cases, but the data is only 1% positives, then there's only 100 positive cases and this no longer behaves like a large 10k sample. Also there is a difference between classification and regression --- often regression can work with fewer samples because there is more information in the label of each sample compared to Boolean classification where the label is only 0 or 1.

In summary, EBMs are reasonably sample efficient, needing somewhat more data than linear methods, but usually not as much data as more complex black-box methods such as neural nets and unrestricted boosted trees, and EBMs often work well with sample sizes of about 1000 cases or more. If there are very few samples for training, sometimes it helps to play with the EBM hyperparameters to do more outer bagging, fewer bins, and even shorter trees.

basnetpro3 commented 5 months ago

Thank you very much sir, I really liked your EBM model. It means if we have 3 or 4 features we can still get insights from EBM using less data. There are some research papers using EBM which have used data around 300 or less. If we read research papers we can't really say how much data is actually required for perticular model because every paper's data varies from 50 -100 and more.