microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.34k stars 3.8k forks source link

[R-package] [python-package] More convenient API for interaction_constraints #6376

Open mayer79 opened 3 months ago

mayer79 commented 3 months ago

Interaction constraints are extremely useful in practice. The API in LightGBM has two aspects that could be improved:

  1. Currently, the Python package allows only integer positions like [[0], [1, 2]]. We could also allow feature names as in XGBoost, or in the R-package of LightGBM.
  2. Features not appearing in any interaction group are silently dropped from the training altogether. Unlike in Scikit-Learn and XGBoost, where missing features can interact (i.e., build an own interaction group). This applies to both Python/R-APIs. Above example could be entered as [[0]], or as [["living_area"]].

Both aspects help to reduce user mistakes and aligns the API with XGBoost (and partly with Scikit-Learn's histogram gradient boosting implementation).

I can make two PRs: one for the R-package, and one for the Python package.

@jameslamb @btrotta

mayer79 commented 3 months ago

@jameslamb could you also add a Python tag? There, we would need two things: translating feature names to int positions, and adding the set of skipped features as additional set.

mayer79 commented 3 months ago

Question to @jameslamb and others: In Python, the interaction constraints are transformed into a string here, just like any other list of list-like object:

https://github.com/microsoft/LightGBM/blob/master/python-package/lightgbm/basic.py,

def _param_dict_to_str(data: Optional[Dict[str, Any]]) -> str:
    """Convert Python dictionary to string, which is passed to C API."""
    if data is None or not data:
        return ""
    pairs = []
    for key, val in data.items():
        if isinstance(val, (list, tuple, set)) or _is_numpy_1d_array(val):
            pairs.append(f"{key}={','.join(map(_to_string, val))}")
        elif isinstance(val, (str, Path, _NUMERIC_TYPES)) or _is_numeric(val):
            pairs.append(f"{key}={val}")
        elif val is not None:
            raise TypeError(f"Unknown type of parameter:{key}, got:{type(val).__name__}")
    return " ".join(pairs)

Is it an option to add a snippet like this?

if key == "interaction_constraints":
  val = _prep_interaction_constraints(vals, feature_names)

_prep_interaction_constraints() would do two things:

  1. Translate feature names to integers (if strings)
  2. Add a list with omitted features

Both would need access to the list of feature_names, so we would need to either pass feature_names = None or the Dataset object to _param_dict_to_str(). Not sure if this the right way to proceed?

jameslamb commented 3 months ago

could you also add a Python tag?

We don't actually have a python label but probably should. I've at least edited the PR title for now.

jameslamb commented 3 months ago

hmmmm @btrotta @guolinke if interaction_constraints is provided, should features that do not appear be excluded from training?

At first I agreed with @mayer79 's assessment that they should not be excluded, but seeing this docstring makes me think they're intentionally excluded:

https://github.com/microsoft/LightGBM/blob/28536a0a5d0c46598fbc930a63cec72399a969e4/include/LightGBM/config.h#L575

jameslamb commented 3 months ago

Is it an option to add a snippet like this?

I'd prefer that that be done outside of _param_dict_to_str(), similar to how categorical_feature is resolved:

https://github.com/microsoft/LightGBM/blob/28536a0a5d0c46598fbc930a63cec72399a969e4/python-package/lightgbm/basic.py#L2110-L2130

I'd support it being a private function, so it can be shared between cv() and train().

jameslamb commented 3 months ago

I'm generally supportive of everything you've proposed at the top of this issue, thanks for working on it and for adding that context about consistency with XGBoost and scikit-learn!

Let's just please wait to hear from @btrotta or @guolinke about whether omitting features not mentioned in interaction_constraints was intentional, there might be something from #2884 and #3130

guolinke commented 3 months ago

Sorry, I don't remember about that 🤣, I think it was not intentional. We can align with XGBoost.

jameslamb commented 3 months ago

😂 ok thank you, I don't remember either. Agree then, let's move forward @mayer79

mayer79 commented 3 months ago

Is it an option to add a snippet like this?

I'd prefer that that be done outside of _param_dict_to_str(), similar to how categorical_feature is resolved:

https://github.com/microsoft/LightGBM/blob/28536a0a5d0c46598fbc930a63cec72399a969e4/python-package/lightgbm/basic.py#L2110-L2130

I'd support it being a private function, so it can be shared between cv() and train().

Thanks for the hint. I will soon draft the function and then we determine where to place it :-).