[question] Incremental learning: drop old trees

memeplex commented 3 years ago

I'm migrating a predictive system from SGD to GB and I would like to keep the incremental approach. Each period a new tree will be added to the ensemble. So far so good, but:

Is it possible to reduce the ensemble learning rate eta by some decay factor (1 - alpha) in order to update it as (1 - alpha) T + t, where T is the current ensemble, t is the tree for the new period and (1 - alpha) is a rate of exponential decay? Notice that this is not the same as an exponentially decaying learning rate schedule. On the contrary, I want previous (old) learning rates to decrease.
At some threshold I would like to remove old trees in the ensemble. If 1 above is possible, at some point (1 - alpha)^k eta will be very small and I prefer to remove the tree altogether in order to reduce computational costs, both in time and space.

memeplex commented 3 years ago

Not that it likes me, but I tried this

lgb.Booster(
    model_str=re.sub(
        "shrinkage=(.*)",
        lambda m: f"shrinkage={float(m.group(1)) * 0.9}",
        booster.model_to_string(start_iteration=1),
    )
)

and it fails (in a bad way: kills the jupyter kernel), not sure why, output looks pretty sane to me:

tree
version=v3
num_class=1
num_tree_per_iteration=1
label_index=0
max_feature_idx=1
objective=regression
feature_names=Column_0 Column_1
feature_infos=[0.0013146511377757353:0.9847186243110676] [0.00083770559309448434:0.99691587076445098]
tree_sizes=465 465 465 466 387 465 466

Tree=0
num_leaves=4
num_cat=0
split_feature=1 1 0
split_gain=451.042 92.2698 52.993
threshold=0.48013677797803911 0.63652721351171337 0.54554768819441068
decision_type=2 2 2
left_child=2 -2 -1
right_child=1 -3 -4
leaf_value=0.36693769931793213 0.71684418394452054 0.97475202560424801 0.56657524794340131
leaf_weight=24 21 31 24
leaf_count=24 21 31 24
internal_value=0 0.870597 0.466756
internal_weight=0 52 48
internal_count=100 52 48
is_linear=0
shrinkage=0.0855

...

If I just do lambda m: f"shrinkage={float(m.group(1)) * 1}", it works, so it's not the replacement per se.

Also, maybe the shrinkage is already applied to the values and even if it worked the above would amount to nothing.

shiyu1994 commented 3 years ago

@memeplex Thanks for using LightGBM. Unfortunately the shrinkage in the model file is only for record. It has no effect when the model is loaded again from the file. The leaf_value is exactly the values of the leaves in the tree, without any adjustment when loading from file. So if you do want to change the shrinkage rate, you have to modify the leaf_value item.

I think what you need is the dynamically adjust the learning_rate during boosting. In that case, I think write a customized objective function may help. Since the leaf values are ratios of sum of gradients and sum of hessians, you may scale the gradients and hessians in the customized objective so that the leaf values change as your expectation. (For example, if you want to shrink the learning rate by 0.9 in some iteration, you may scale the gradients of the original objective function by 0.9 in your customized version of objective function).

As for dropping trees, does start_iteration helps? Do you want to drop trees in the model for inference, or during training?

memeplex commented 3 years ago

Thank you for your answer.

I think what you need is the dynamically adjust the learning_rate during boosting

The problem with this approach is that when you're learning online you need to use, say, the estimator E(s) at time s, and later the estimator E(t) at time t > s, which is built from a modified version of E(s), say E'(s) (for example, applying decay for the elapsed interval (s, t)). It's not possible to do this in advance AFAICS because I need both E(s) and E'(s), and the fact that the leaf values are final seems to preclude the possibility of producing E'(s) from E(s) afterwards.

As for dropping trees, does start_iteration helps?

Yes, this part is easy. But only rectangular windows can be achieved this way.

memeplex commented 3 years ago

As for dropping trees, does start_iteration helps?

Yes, this part is easy. But only rectangular windows can be achieved this way.

And you probably want to set boost_from_average=False so that the first tree has the same shrinkage than the others.

shiyu1994 commented 3 years ago

@memeplex I think we can divide it into two cases:

We want to decay the learning rate during boosting. In this case, a customized objective function can help.
We want to adjust the learning rate of previous trees after training these trees. In that case, we can only modify the model through the model file (changing leaf_value), and load the model into the memory again to continue the training. In the second case, both the original ensemble and the modified (decayed) ensemble are available.

memeplex commented 3 years ago

Yes, my question is about 2. I was expecting something like that. Sadly the json output doesn't work as an input, cause it's easier to modify. Anyway, thanks!

github-actions[bot] commented 1 year ago

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

microsoft / LightGBM

[question] Incremental learning: drop old trees #4455