interpretml / interpret

Fit interpretable models. Explain blackbox machine learning.
https://interpret.ml/docs
MIT License
6.04k stars 714 forks source link

Question: Parallel boosting? #519

Open DerWeh opened 2 months ago

DerWeh commented 2 months ago

While the EBMs are an incredible model, a main pain-point is that they are relatively slow to fit (compared to random forests). Bagging can easily be parallelized, while boosting is sequential. However, in the special case of EBMs I get the impression, that it should be possible to parallelize the boosting algorithm over the features.

In the explanations for EBMs, it is mentioned that a tiny tree is fit to each feature in a round-robin fashion while employing a very small learning rate, such that the order of features doesn't matter. If the order doesn't matter, shouldn't it be possible to fit all features in parallel? Of course, the results will be slightly different (in the same way, the results would be slightly different with the round-robin algorithm if we shuffle the features).

For now, this is mostly a theoretical questions, as I haven't investigated the C++ core yet. A main question is, of course, what kind of speed-ups are possible if the boosting can indeed be parallelized. After all, communication is necessary after each boosting round of fitting a tree for each feature. This probably depends on how expensive it is to fit the trees compared to the cost of calculating the residues.

paulbkoch commented 2 months ago

I agree with your assessment @DerWeh that we should be able to parallelize boosting on multiple features at a time. I think there is a reasonable limit though where if there were for example 1000 features, we might be able to boost 20-ish in parallel. If all features were truly uncorrelated then we could probably boost more at a time, but in practice there's often quite a bit of correlation between features, so we probably want to be careful about how we handle this. I've heard from someone who tried doing all features simultaneously that it didn't work as well as expected, but I think doing a subset is still a viable possibility. This is something that we can prototype using the existing python interface (in the boost function) before trying to implement it for real with all the parallelism baggage that it would entail.

There's another avenue to parallelization though that comes with potentially less complications. We first bin the features into 1024 bins (by default), so if we subdivide the dataset into subsets of several tens of thousands, we can build the histograms for each subset independently, and then use a reduce operation to merge the histograms afterwards. In theory this should allow EBMs to be massively parallelized in clusters, etc..

I think ultimately someday when EBMs are built on clusters and in GPUs we'll use both approaches.

paulbkoch commented 2 months ago

And we already have the concept of data sub-sets in C++ to prepare for this eventuality:

https://github.com/interpretml/interpret/blob/9cbb353e8cec27368d9d9a091c16cf7be16a469e/shared/libebm/DataSetBoosting.hpp#L26

And we can then merge the histograms from the subsets after constructing them with this function:

https://github.com/interpretml/interpret/blob/develop/shared/libebm/ConvertAddBin.cpp

DerWeh commented 2 months ago

I think there is a reasonable limit though where if there were for example 1000 features, we might be able to boost 20-ish in parallel. If all features were truly uncorrelated then we could probably boost more at a time, but in practice there's often quite a bit of correlation between features, so we probably want to be careful about how we handle this.

Very interesting point. Of course, even if the increment (learning rate) is small, the contribution is significant if I do many steps (one for each feature). Never tried this regime, as with so many feature most of the (global) interpretability is probably lost. Was it also reported, that feature order matter is this case?

This is something that we can prototype using the existing python interface (in the boost function) before trying to implement it for real with all the parallelism baggage that it would entail.

Thanks for pointing out, haven't made my way till the code of the boost function yet. When I get there, I'll give it a try.

There's another avenue to parallelization though that comes with potentially less complications. We first bin the features into 1024 bins (by default), so if we subdivide the dataset into subsets of several tens of thousands, we can build the histograms for each subset independently, and then use a reduce operation to merge the histograms afterwards. In theory this should allow EBMs to be massively parallelized in clusters, etc..

This is indeed a very interesting avenue. Though, I am not quite sure yet how to go about this. Maybe I need to first take a deeper look that the boosting routine. Could you elaborate a bit? Do you plan to naively (randomly) split the dataset into subsets and use the same binning and then merge the final EBM weighting each bin by the number of samples in that subset? Or is there some sophisticated way to split the dateset, resulting in disjunct bins? This seems quite hard due to high dimensionality. Or have I missed the point completely?


Parallelism would give a massive boost to EBMs, as we have many (cheap) CPUs nowadays. With GPU programming, I have too little experience. If the algorithm is really suitable, this would of course be a tremendous speedup.

paulbkoch commented 2 months ago

I think there is a reasonable limit though where if there were for example 1000 features, we might be able to boost 20-ish in parallel. If all features were truly uncorrelated then we could probably boost more at a time, but in practice there's often quite a bit of correlation between features, so we probably want to be careful about how we handle this.

Very interesting point. Of course, even if the increment (learning rate) is small, the contribution is significant if I do many steps (one for each feature). Never tried this regime, as with so many feature most of the (global) interpretability is probably lost. Was it also reported, that feature order matter is this case?

I believe the person who tried it didn't care about feature order since they boosted all features together independently at the same time. There were other differences between their implementation and EBMs, and I think they used a higher learning rate, so I don't know how much of this will be valid in the domain of EBMs. Since it's easy enough to check, I figured that should be the plan when the time came.

There are some other issues that I didn't mention previously for brevity. For simplicity we (currently) start from a default score of 0 when we boost the mains. The first few boost steps essentially move the intercept closer to its ultimate value. If we had 1000 features being boosted together then they'll all overshoot this intercept by a large amount in total. Now, this is something that can be corrected for by initializing to a reasonable initial intercept, but hopefully it highlights a separate point that boosting can become unstable if you attempt to boost too many features at once. Even with initialization of the intercept this instability could still happen with correlated features where all the features would be positive in some sections of their graphs, etc. Despite all this, I do think the idea is valuable up to some reasonable point that is yet to be determined.

There's another avenue to parallelization though that comes with potentially less complications. We first bin the features into 1024 bins (by default), so if we subdivide the dataset into subsets of several tens of thousands, we can build the histograms for each subset independently, and then use a reduce operation to merge the histograms afterwards. In theory this should allow EBMs to be massively parallelized in clusters, etc..

This is indeed a very interesting avenue. Though, I am not quite sure yet how to go about this. Maybe I need to first take a deeper look that the boosting routine. Could you elaborate a bit? Do you plan to naively (randomly) split the dataset into subsets and use the same binning and then merge the final EBM weighting each bin by the number of samples in that subset? Or is there some sophisticated way to split the dateset, resulting in disjunct bins? This seems quite hard due to high dimensionality. Or have I missed the point completely?

Parallelism would give a massive boost to EBMs, as we have many (cheap) CPUs nowadays. With GPU programming, I have too little experience. If the algorithm is really suitable, this would of course be a tremendous speedup.

We already split the data into subsets within the C++ in order to use SIMD effectively. We need the SIMDable data to be a multiple of the SIMD packing size. We also set a maximum number of samples that can go into each subset with this parameter:

https://github.com/interpretml/interpret/blob/0cf3f2f3c4ea7b4cf498e732295defdb712e4ad5/shared/libebm/bridge/bridge.hpp#L29

If we wanted to parallelize this more at the thread level (and we do!), we might want to drop this number to something like 50k samples per subset. The samples are and can be put into the subsets without any restrictions from the feature contents.

There are two phases during training that are bottlenecks. The first phase builds histograms and the second one applies a model update. During the histogram construction phase, for each feature we have bins and for each bin of a feature we need to know the following values: sum_samples, sum_weights, sum_gradients, sum_hessians. Since all of these are just sums, we can build the sums in parallel for all subsets. Once we have the final sums (which we'd use a reduce operation to combine) we do tree building and select cuts, which is not a speed bottleneck. Then we need to apply the scores updates that we determined, which is the 2nd compute bottleneck. We have an array that we need to index into by the feature values (interactions are a tensor) and add the value in the array to the existing per-sample scores that we have. Since the samples are subdivided into subsets, we can again do this update in parallel accross the subsets.

In summary, EBMs should parallelize fairly well for most datasets, and this is something we could do on a GPU. If we have a dataset with a limited number of samples but many features, this won't work as well though, so we're back to looking at the multiple feature at a time parallelism that we started this discussion about.