microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.66k stars 3.83k forks source link

Random assignment for a subset of variables in predict. #5177

Open jaguerrerod opened 2 years ago

jaguerrerod commented 2 years ago

Summary

Add random assignment for a subset of variables in predict.

Motivation

Random assignment for a variable V in a tree is a way to drop the effect of that variable in the model. The procedure is assign randomly in every node that uses V for spliting and in the rest do as usual. Random assignment of a subset of variables W = {V1, V2..Vi} is similar. The assignment in nodes which splitting use any variable in W is random and the rest as usual.

This is usefull for computing honest variable importance over a new test set as the difference between the prediction error with and without perturbation. This method can be used for interaction computation of two subsets of variables V and W, defined as the difference between paired importance (importance when both V and W are assigned randomly) vs added importance (sum of importance when each of W or V is assigned randomly).

References

The method is referenced in: Hemant Ishwaran. "Variable importance in binary regression trees and forests." Electron. J. Statist. 1 519 - 537, 2007. https://doi.org/10.1214/07-EJS039

suncanghuai commented 1 year ago

Call Path For Prediction

I have organized most of the call paths for LightGBM's prediction functions to figure out how to implement this feature. There may be some trivial mistakes because I'm new to LightGBM's source code, but these shouldn't be problems:smirk: LightGBM_Predict_call_stack

Thoughts on implementation

It seems to me that this feature is about adding one or more input parameters in Apis in the External Interface Layer and passing them down through a call path to control prediction behavior in Model Layer. As to add more input parameters, I think we may add two variables to Class Config which is used as an expandable cross-language parameter set. One variable represents the type of prediction strategy, another variable contains control parameters of prediction strategy as "a subset of variables" mentioned in this issue. We should also do some similar work to Class Predictor.

  //this piece of code in c_api.c shows the way we pass an expandable cross-language parameter set from Python/R to C/C++
  API_BEGIN();
  auto param = Config::Str2Map(para meter);
  Config config;
  config.Set(param);
  OMP_SET_NUM_THREADS(config.num_threads);
  Booster* ref_booster = reinterpret_cast<Booster*>(handle);
  ref_booster->Predict(start_iteration, num_iteration, predict_type, data_filename, data_has_header,
                       config, result_filename);
  API_END();

As to control prediction behavior in Model Layer, I think there is a choice here:

  1. add a new function as predict/predictRaw/predictContrib to Class Boosting and Class Tree image
  2. introduce a new Class named as PredictStrategy(to be determined), which is a wrapper class for Tree::Predict. We create a PredictStratgy object in Booster Layer and pass it down to Class Tree.
    inline int Tree::GetLeafByMap(const std::unordered_map<int, double>& feature_values) const {
    int node = 0;
    auto decision_func = pred_strategy.getDecisonFunc(self); 
    auto num_decision_func = pred_strategy.getNumericalDecisonFunc(self); 
    if (num_cat_ > 0) {
    while (node >= 0) {
      node = decision_func(feature_values.count(split_feature_[node]) > 0 ? feature_values.at(split_feature_[node]) : 0.0f, node);
    }
    } else {
    while (node >= 0) {
      node = num_decision_func(feature_values.count(split_feature_[node]) > 0 ? feature_values.at(split_feature_[node]) : 0.0f, node);
    }
    }
    return ~node;
    }

    Consider that we may have some similar feature requests in the future, choice 2 is a more scalable option. But choice 2 is a obviously more complex modification that possibly causes some potential code defects. In contrast, choice 1 is more easy to understand and less couple with other code designs. Although stacking more similar functions is not a elegant code design, but also not a problem. So I think choice 1 is a better option.

It seems this issue has no one working on it. I'd like to work on it and already have an preliminary scheme. May I have this issue assigned to me and get some suggestions for my preliminary scheme? @jameslamb

guolinke commented 1 year ago

Hi @suncanghuai , thank you very much for the detailed plan and careful thoughts! The scheme looks good to me, I think option 1 is better for now; option 2 could be left in the future if there are indeed more similar functions.

Please go ahead, and ping us if you have any further questions.