less penalty if the tree splits samples more randomly from different groups.

chloe-wang commented 3 years ago

I grouped the samples into several groups manually. I would like to force the tree to split more randomly in the group view.

For example, if I have 100 samples in 2 groups here, I prefer the tree to split the samples into two leaves while both left leaf and right leaf have almost equal samples from both groups: 25(group 1) + 25(group 2) for both left leaf and right leaf.

Motivation

Description

References

jameslamb commented 3 years ago

@chloe-wang , thanks for using LightGBM!

When LightGBM builds trees, it chooses splits based on the estimated gain (improvement in the training loss) for those splits. Typically that is not tightly related to "number of samples".

You can read more about this process at https://lightgbm.readthedocs.io/en/latest/Features.html#leaf-wise-best-first-tree-growth, or in XGBoost's excellent tutorial on gradient boosting with trees: https://xgboost.readthedocs.io/en/latest/tutorials/model.html.

If you are doing binary classification (trying to train a model to predict whether a sample is in group 1 or group 2), then the exact opposite behavior of what you're asking for is desirable...LightGBM will try to create splits where leaf nodes have samples that are mostly in group 1 or mostly in group 2.

If you are working on a regression task and you mean that you want a roughly equal distribution of some categorical feature on either side of a split, I suppose it's possible to achieve that by writing a custom objective function. See the note about custom objective functions at https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRegressor.html#lightgbm-lgbmregressor and the example at https://github.com/microsoft/LightGBM/blob/926526c838196a7497a85b6b8cf07657a88b69e6/examples/python-guide/advanced_example.py#L138 for more details.

If you have followup questions, please provide the following information so we can give you a more useful answer:

what version of LightGBM are you working with?
what programming language and/or framework are you using?
what type of machine learning task are you working on? (e.g. regression, binary classification, multiclass classification)

no-response[bot] commented 3 years ago

This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM!

github-actions[bot] commented 1 year ago

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

microsoft / LightGBM