Open jaguerrerod opened 2 years ago
I have organized most of the call paths for LightGBM's prediction functions to figure out how to implement this feature. There may be some trivial mistakes because I'm new to LightGBM's source code, but these shouldn't be problems:smirk:
It seems to me that this feature is about adding one or more input parameters in Apis in the External Interface Layer and passing them down through a call path to control prediction behavior in Model Layer. As to add more input parameters, I think we may add two variables to Class Config which is used as an expandable cross-language parameter set. One variable represents the type of prediction strategy, another variable contains control parameters of prediction strategy as "a subset of variables" mentioned in this issue. We should also do some similar work to Class Predictor.
//this piece of code in c_api.c shows the way we pass an expandable cross-language parameter set from Python/R to C/C++
API_BEGIN();
auto param = Config::Str2Map(para meter);
Config config;
config.Set(param);
OMP_SET_NUM_THREADS(config.num_threads);
Booster* ref_booster = reinterpret_cast<Booster*>(handle);
ref_booster->Predict(start_iteration, num_iteration, predict_type, data_filename, data_has_header,
config, result_filename);
API_END();
As to control prediction behavior in Model Layer, I think there is a choice here:
inline int Tree::GetLeafByMap(const std::unordered_map<int, double>& feature_values) const {
int node = 0;
auto decision_func = pred_strategy.getDecisonFunc(self);
auto num_decision_func = pred_strategy.getNumericalDecisonFunc(self);
if (num_cat_ > 0) {
while (node >= 0) {
node = decision_func(feature_values.count(split_feature_[node]) > 0 ? feature_values.at(split_feature_[node]) : 0.0f, node);
}
} else {
while (node >= 0) {
node = num_decision_func(feature_values.count(split_feature_[node]) > 0 ? feature_values.at(split_feature_[node]) : 0.0f, node);
}
}
return ~node;
}
Consider that we may have some similar feature requests in the future, choice 2 is a more scalable option. But choice 2 is a obviously more complex modification that possibly causes some potential code defects. In contrast, choice 1 is more easy to understand and less couple with other code designs. Although stacking more similar functions is not a elegant code design, but also not a problem. So I think choice 1 is a better option.
It seems this issue has no one working on it. I'd like to work on it and already have an preliminary scheme. May I have this issue assigned to me and get some suggestions for my preliminary scheme? @jameslamb
Hi @suncanghuai , thank you very much for the detailed plan and careful thoughts! The scheme looks good to me, I think option 1 is better for now; option 2 could be left in the future if there are indeed more similar functions.
Please go ahead, and ping us if you have any further questions.
Summary
Add random assignment for a subset of variables in predict.
Motivation
Random assignment for a variable V in a tree is a way to drop the effect of that variable in the model. The procedure is assign randomly in every node that uses V for spliting and in the rest do as usual. Random assignment of a subset of variables W = {V1, V2..Vi} is similar. The assignment in nodes which splitting use any variable in W is random and the rest as usual.
This is usefull for computing honest variable importance over a new test set as the difference between the prediction error with and without perturbation. This method can be used for interaction computation of two subsets of variables V and W, defined as the difference between paired importance (importance when both V and W are assigned randomly) vs added importance (sum of importance when each of W or V is assigned randomly).
References
The method is referenced in: Hemant Ishwaran. "Variable importance in binary regression trees and forests." Electron. J. Statist. 1 519 - 537, 2007. https://doi.org/10.1214/07-EJS039