XGBoost: Expose remaining missing parameters

exalate-issue-sync[bot] commented 1 year ago

We may still be missing a few XGBoost parameters and also add them to the User Guide. The list is currently:

Here are some [general parameters|https://xgboost.readthedocs.io/en/latest//parameter.html#general-parameters] that are missing, but we could ignore all of these since they are not relevant (or perhaps just implement "silent")":

silent [default=0] 0 means printing running messages, 1 means silent mode.
nthread [default to maximum number of threads available if not set] number of parallel threads used to run xgboost
num_pbuffer [set automatically by xgboost, no need to be set by user] size of prediction buffer, normally set to number of training instances. The buffers are used to save the prediction results of last boosting step.
num_feature [set automatically by xgboost, no need to be set by user] feature dimension used in boosting, set to maximum dimension of the feature

[Tree Booster parameters|https://xgboost.readthedocs.io/en/latest//parameter.html#parameters-for-tree-booster] that are missing

sketch_eps, [default=0.03] This is only used for approximate greedy algorithm. This roughly translated into O(1 / sketch_eps) number of bins. Compared to directly select number of bins, this comes with theoretical guarantee with sketch accuracy. Usually user does not have to tune this. but consider setting to a lower number for more accurate enumeration. range: (0, 1)
scale_pos_weight, [default=1] Control the balance of positive and negative weights, useful for unbalanced classes. A typical value to consider: sum(negative cases) / sum(positive cases) See Parameters Tuning for more discussion. Also see Higgs Kaggle competition demo for examples: R, py1, py2, py3
updater, [default=’grow_colmaker,prune’] A comma separated string defining the sequence of tree updaters to run, providing a modular way to construct and to modify the trees. This is an advanced parameter that is usually set automatically, depending on some other parameters. However, it could be also set explicitely by a user. The following updater plugins exist: ‘grow_colmaker’: non-distributed column-based construction of trees. ‘distcol’: distributed tree construction with column-based data splitting mode. ‘grow_histmaker’: distributed tree construction with row-based data splitting based on global proposal of histogram counting. ‘grow_local_histmaker’: based on local histogram counting. ‘grow_skmaker’: uses the approximate sketching algorithm. ‘sync’: synchronizes trees in all distributed nodes. ‘refresh’: refreshes tree’s statistics and/or leaf values based on the current data. Note that no random subsampling of data rows is performed. ‘prune’: prunes the splits where loss < min_split_loss (or gamma). In a distributed setting, the implicit updater sequence value would be adjusted as follows: ‘grow_histmaker,prune’ when dsplit=’row’ (or default) and prob_buffer_row == 1 (or default); or when data has multiple sparse pages ‘grow_histmaker,refresh,prune’ when dsplit=’row’ and prob_buffer_row < 1 ‘distcol’ when dsplit=’col’
refresh_leaf, [default=1] This is a parameter of the ‘refresh’ updater plugin. When this flag is true, tree leafs as well as tree nodes’ stats are updated. When it is false, only node stats are updated.
process_type, [default=’default’] A type of boosting process to run. Choices: {‘default’, ‘update’} ‘default’: the normal boosting process which creates new trees. ‘update’: starts from an existing model and only updates its trees. In each boosting iteration, a tree from the initial model is taken, a specified sequence of updater plugins is run for that tree, and a modified tree is added to the new model. The new model would have either the same or smaller number of trees, depending on the number of boosting iteratons performed. Currently, the following built-in updater plugins could be meaningfully used with this process type: ‘refresh’, ‘prune’. With ‘update’, one cannot use updater plugins that create new nrees.
predictor, [default=’cpu_predictor’] The type of predictor algorithm to use. Provides the same results but allows the use of GPU or CPU. ‘cpu_predictor’: Multicore CPU prediction algorithm. ‘gpu_predictor’: Prediction using GPU. Default for ‘gpu_exact’ and ‘gpu_hist’ tree method.

[Parameters for Linear Booster|https://xgboost.readthedocs.io/en/latest//parameter.html#parameters-for-linear-booster]:

lambda_bias [default=0, alias: reg_lambda_bias] L2 regularization term on bias (no L1 reg on bias because it is not important

[Learning Task Parameters|https://xgboost.readthedocs.io/en/latest//parameter.html#learning-task-parameters]

objective [default=reg:linear] “reg:linear” –linear regression “reg:logistic” –logistic regression “binary:logistic” –logistic regression for binary classification, output probability “binary:logitraw” –logistic regression for binary classification, output score before logistic transformation “count:poisson” –poisson regression for count data, output mean of poisson distribution max_delta_step is set to 0.7 by default in poisson regression (used to safeguard optimization) “multi:softmax” –set XGBoost to do multiclass classification using the softmax objective, you also need to set num_class(number of classes) “multi:softprob” –same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata, nclass matrix. The result contains predicted probability of each data point belonging to each class. “rank:pairwise” –set XGBoost to do ranking task by minimizing the pairwise loss “reg:gamma” –gamma regression with log-link. Output is a mean of gamma distribution. It might be useful, e.g., for modeling insurance claims severity, or for any outcome that might be gamma-distributed “reg:tweedie” –Tweedie regression with log-link. It might be useful, e.g., for modeling total loss in insurance, or for any outcome that might be Tweedie-distributed.
base_score [default=0.5] the initial prediction score of all instances, global bias for sufficient number of iterations, changing this value will not have too much effect.
eval_metric [default according to objective] evaluation metrics for validation data, a default metric will be assigned according to objective (rmse for regression, and error for classification, mean average precision for ranking ) User can add multiple evaluation metrics, for python user, remember to pass the metrics in as list of parameters pairs instead of map, so that latter ‘eval_metric’ won’t override previous one The choices are listed below: “rmse”: root mean square error “mae”: mean absolute error “logloss”: negative log-likelihood “error”: Binary classification error rate. It is calculated as #(wrong cases)/#(all cases). For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances. “error@t”: a different than 0.5 binary classification threshold value could be specified by providing a numerical value through ‘t’. “merror”: Multiclass classification error rate. It is calculated as #(wrong cases)/#(all cases). “mlogloss”: Multiclass logloss “auc”: Area under the curve for ranking evaluation. “ndcg”:Normalized Discounted Cumulative Gain “map”:Mean average precision “ndcg@n”,”map@n”: n can be assigned as an integer to cut off the top positions in the lists for evaluation. “ndcg-”,”map-”,”ndcg@n-”,”map@n-”: In XGBoost, NDCG and MAP will evaluate the score of a list without any positive samples as 1. By adding “-” in the evaluation metric XGBoost will evaluate these score as 0 to be consistent under some conditions. training repeatedly “poisson-nloglik”: negative log-likelihood for Poisson regression “gamma-nloglik”: negative log-likelihood for gamma regression “gamma-deviance”: residual deviance for gamma regression “tweedie-nloglik”: negative log-likelihood for Tweedie regression (at a specified value of the tweedie_variance_power parameter)

Full list here: https://xgboost.readthedocs.io/en/latest//parameter.html

exalate-issue-sync[bot] commented 1 year ago

Angela Bartz commented: FYI, the lambda_bias parameter is in the updated XGBoost topic in the User Guide.

hasithjp commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-4795 Assignee: Erin LeDell Reporter: Erin LeDell State: Closed Fix Version: 3.16.0.1 Attachments: N/A Development PRs: Available

Linked PRs from JIRA

https://github.com/h2oai/h2o-3/pull/1452

h2oai / h2o-3

XGBoost: Expose remaining missing parameters #11674