interpretml / interpret

Fit interpretable models. Explain blackbox machine learning.
https://interpret.ml/docs
MIT License
6.04k stars 715 forks source link

Some hyperparameter questions #529

Open alvanli opened 4 weeks ago

alvanli commented 4 weeks ago

Hi,

Greediness

I saw that in 0.6.0, greediness was split into greedy_ratio and cyclic_progress.

greedy_ratio (float, default=1.5) – The proportion of greedy boosting steps relative to cyclic boosting steps. A value of 0 disables greedy boosting, effectively turning it off.

cyclic_progress (bool or float, default=True) – This parameter specifies the proportion of the boosting cycles that will actively contribute to improving the model’s performance. It is expressed as a bool or float between 0 and 1, with the default set to True(1.0), meaning 100% of the cycles are expected to make forward progress. If forward progress is not achieved during a cycle, that cycle will not be wasted; instead, it will be used to update internal gain calculations related to how effective each feature is in predicting the target variable. Setting this parameter to a value less than 1.0 can be useful for preventing overfitting.

From my understanding, the original greediness mean the proportion of rounds that take the feature split that best minimizes the residual (greedy) vs go through each feature in order (cyclic).

Max leaves

paulbkoch commented 4 weeks ago

Hi @alvanli -- Thank you for this question because it's both interesting and something we haven't talked about or documented in detail yet. It could also be important in some circumstances.

EBMs are by default no longer a purely cyclic algorithm as described in the original GA2M papers. If you prefer to restore the cyclic nature of EBMs, you can do this easily by setting greediness or greedy_ratio to zero. The current default algorithm is a hybrid that sits somewhere in between fully greedy and fully cyclic. I've been calling it "semi-greedy" boosting.

The greediness parameter has been available for some time, but only recently became the default. I'll start chronologically by describing the original greediness parameter. That will be good background for what greedy_ratio is now.

In the original cyclic algorithm, the definition of a round was a complete sweep of all the features, so if you had 100 features, you'd make 100 boosting steps per round. During each boosting step we'd calculate the per-feature update, but we'd also get, essentially for free, the gain calculation prior to the update. The interesting thing is, since our learning rate is so low, we make small incremental steps and the gain calculation remains relatively valid after applying the boosting update. We can use this by putting the slightly stale gain calculations into a priority queue and then greedily picking which feature to boost next. Generally, assuming correlations aren't too bad, boosting one feature doesn't affect the gain of other features too much, but even so we'd prefer to periodically resweep all the features to refresh their gain calculations. We originally did this by intermixing cyclic rounds with greedy rounds. The cyclic rounds handle the issue of refreshing the gains, and the greedy rounds allow the algorithm to focus on the most important features.

Why was this algorithm change desirable? Well, the main issue with purely cyclic boosting was that it tended to overfit some features and underfit others. This new algorithm seems to be better at finding the right amount of boosting to do on each feature. It appears, although we need more time to digest this, that some of the jaggedness on EBMs graphs was due to late-stage overfitting of some features which would otherwise be smoother.

The original semi-greedy algorithm kept the number of boosting steps per rounds fixed. The greediness parameter defined how often to execute a greedy round vs a cyclic round. A good number used to be 0.5 which would alternate between greedy round and cyclic rounds. Sometimes you might want 0.66, which would do 1 cyclic round for every 2 greedy rounds.

In the 0.6.0 release we made 2 slight changes to the algorithm. It's clear that the cyclic rounds need to have as many boosting steps as the number of features, since every feature needs to be visited during the round. The greedy rounds however do not need to rigidly obey the same restriction. If there are 100 features, we could allow 90 greedy boosting steps for example. The greedy_ratio sets the ratio of the number of boosting steps in the greedy rounds vs the cyclic rounds. So, for example, the value 1.0 is exactly equivalent to the old greediness value of 0.5, since with a ratio of 1.0 you would get 100 cyclic steps followed by 100 greedy steps intermixed. The default greedy_ratio is now set to 1.5, which means that if there are 100 features there will be 150 greedy boosting steps between cyclic rounds.

The algorithm described above tends to work well on many datasets, but let's say for whatever reason you don't want to have any cyclic rounds at all. You can set greedy_ratio to some huge number that basically makes all boosting greedy, but then your algorithm is no longer periodically refreshing the gain calculations though cyclic rounds, which means that the gain calculations for features that aren't being greedily chosen during the greedy rounds are getting more and more stale. For boosting, it takes almost the same amount of work to calculate the gain as it does to calculate the update. To refresh the gain calculations we need to pay this cost, but instead of applying the update, we could just do the calculations and then throw away the update. This is what the cyclic_progress parameter is for. The easiest way to think about this parameter is as a boolean parameter that indicates whether during the cyclic rounds we apply the updates or not. If the updates are not applied and the greedy_ratio is set to a low value, then the algorithm becomes something a lot closer to what XGBoost does, but still retains the additive nature of EBMs.

Generally, the current defaults with the greedy_ratio set to 1.5 seems to do well on many datasets, and avoids the enormous computational cost that updating the gains on each boosting step would otherwise require.

paulbkoch commented 4 weeks ago

Hi @alvanli -- On the other questions about max_leaves:

The max_leaves parameter is only honored when boosting mains currently. For pairs we make one cut in one dimension and then make cuts on each side of the primary cut for the other dimension. At some point we'll implement the obvious improvement that makes pairs cuts more flexible and allows for multiple cuts in each dimension.

FAST is currently even more restricted in that it chooses cuts in both dimensions that intersect in a cross. At some point we'll add options to allow for more complex tree growth, but the current FAST algorithm has the benefit of being FAST and easier to implement.

For mains, it appears the best value for max_leaves on many/most datasets is 3. 2 is pretty bad actually. 4 tends to be a bit worse than 3, but it's generally pretty close to 3, so I added it as a hyperparameter to get feedback and I could see it being perhaps better on some datasets. Incidentally, I think this property is one of the main advantages that EBMs have over XGBoost and similar algorithms since XGBoost tends to grow the trees until depth 6 which can cause overfitting on the leaf splits. The other benefit that EBMs have over XGBoost is that EBMs consider pair splits jointly, which would be cost prohibitive for unrestricted trees.

alvanli commented 4 weeks ago

thank you for the always very detailed explanation! I learned a lot reading these tickets 🤗