Query: performance prospects on massive data sets (curse of dimensionality?)

tylerjereddy commented 2 months ago

There are some nice articles about how EBMs gain the best of both worlds (performance and explainability), and generally I've found that to be true. However, we've been working on an exceptionally high dimensionality data set in the bioinformatics domain (shape ~ 900 records x ~ 860,000 features (float 64)/dimensions). Are there any published results describing acceptable/reasonable performance in this kind of scenario? Conversely, are there any descriptions of practical limits in terms of the number of features (dimensions)?

What about prospects for improvement in the future? It would be really neat to be able to assess feature importance on enormous design matrices that are refractive to many feature importance techniques. For example, with almost a million features and ~1/3 of them pretty highly correlated with each other, approaches like random forest feature importance might seem appealing, and certainly for performance it can be, but the randomness in feature selection can also lead to the dilution of the importance of correlated features.

As a more concrete description of what we're seeing, if I wait for 4.5 hours on a compute node with 6 TB of memory, I'll see a gradual (but quite slow compared to say a serious memory leak) increase in memory footprint to 700 GiB RAM, but no indication of progress (I don't think there's a verbose mode?), and I only have a single tree for benchmarking purposes:

     ebm_file = "explain_data.p"
     if not Path(ebm_file).exists():
-        ebm = ExplainableBoostingClassifier()
+        ebm = ExplainableBoostingClassifier(max_rounds=1)
         ebm.fit(X_train, y_train) # about 860,000 features

Anyway not complaining, just wondering if this is something that is tractable or just unreasonable even in the long-term? For random forest (sklearn), it is about 6-7 minutes for 10,000 estimators, though closer to an hour if using concurrent oob estimates for sanity checking. While I'd obviously expect a parallel ensemble technique to be faster than a sequential one, in this case I've reduced the sequence of trees to length 1 (minus any internal aggregating at each level I may not understand). Looks like we're using interpret 0.5.1.

paulbkoch commented 2 months ago

Hi @tylerjereddy -- Glad to hear you've been finding that EBMs perform well on many of your datasets, and hopefully we can make EBMs work on this one too. We can currently process about 1,000,000,000 sample feature items per day (for classification). Your problem being 900 860,000 = 774,000,000 would be expected to take about 1 day under this formula, however there are two caveats:

1) The 1,000,000,000 number assumes that you have a larger percentage of samples than what you have. With 900 samples the code will spend more time in python where it will be slower, and also things like tree building take comparatively more time. I think you're still looking at something in the low number of days with this workload, but it's hard to say for sure without trying it. If you run it with max_rounds=1 and then multiply the time for that by about 2000 (since EBMs typically run that many rounds) it should be a good estimate.

2) By default, we examine all possible pairs to detect interactions. Normally this is fine, but with 860,000 features you have many billions of possible interactions, and that is going to make it impractical to detect them automatically. If you have specific features that you think are going to be useful then you can specify them manually with something like: interactions=[(1,2), (5, 100), ("feature1", "feature2")], otherwise set interactions=0. With 900 samples the interactions are going to be pretty useless anyways.

Please let us know how this works out for you. It's useful for us to get this kind of feedback.

paulbkoch commented 2 months ago

On the question of memory leaks: I think what you're observing is due to the normal memory fragmentation that you'd expect to find in this kind of process. We run valgrind on our nightly build, and we've only once (to my knowledge) had a memory leak that survived a few days in the code. Obviously, memory leaks are an area where there can be surprises, but you're running with mostly default parameters which is an area that should be fairly well tested.

paulbkoch commented 2 months ago

One more tip: We run the outer bags on separate cores, and we leave one core free by default on the machine. If you have 8 cores, then setting outer_bags to 7 will be ideal in term of CPU utilization. The 0.5.1 release increases outer bags to 14, so reducing the number of outer bags would improve the speed unless you have more than 14 cores. If you have a big machine with more cores, your model can benefit a little bit by using more of the available hardware by setting outer bags to the number of cores minus one.

interpretml / interpret

Query: performance prospects on massive data sets (curse of dimensionality?) #513