Dart - very poor accuracy

gugatr0n1c commented 7 years ago

When I use dart as a booster I always get very poor performance in term of l2 result for regression task. Even If I use small drop_rate = 0.01 or big like 0.3.

When I use dart in xgboost on same dataset, with similar setting (same learning rate, similiar num_trees) dart alwasy give me boost for accuracy (small but always).

But here accuracy is poor badly, like there is a bug, not just dart is not suitable for my task.

Can anyone confirm that dart is working for regression task in term of better accuracy?

My setting is as follows (part of the Python code for ramdom search of params):

lr = np.random.choice([0.01, 0.005, 0.0025]) list_count = np.random.choice([250, 500, 750, 1000]) min_in_leaf = np.random.choice([25, 50, 100]) subF = np.random.choice([0.15, 0.22, 0.3, 0.5, 0.66, 0.75]) subR = np.random.choice([0.66, 0.75, 0.83, 0.9]) max_depth = np.random.choice([9, 11, 15, 25, 45, 100, -1]) dart_rate = np.random.choice([0, 0, 0, 0.01, 0.03, 0.1]) max_bin = np.random.choice([63, 127, 255, 511]) lambda_l1 = np.random.choice([0, 1., 10., 100.]) lambda_l2 = np.random.choice([0, 1., 10., 100.])

iterace = 10000 if only_testing: min_in_leaf = 25 iterace = 10

boost_type = 'gbdt' if dart_rate > 0: boost_type = 'dart'

params = { 'task' : 'train', 'boosting_type' : boost_type, 'objective' : 'regression', 'metric' : 'l2', 'max_depth' : int(max_depth), 'num_leaves' : int(list_count), 'min_data_in_leaf' : int(min_in_leaf), 'learning_rate' : lr, 'feature_fraction' : subF,
'bagging_fraction' : subR, 'bagging_freq': 1, 'verbose' : 0, 'nthread' : nthread, 'drop_rate': dart_rate, 'max_bin': max_bin, 'lambda_l1' : lambda_l1, 'lambda_l2' : lambda_l2 }

model = lg.train( params, (matrix_learn, target_learn), num_boost_round = iterace, valid_datas = (matrix_test, target_test), early_stopping_rounds = 50 )

gugatr0n1c commented 7 years ago

What is working in XGB is probably not just drop_rate but also skip_drop. http://xgboost.readthedocs.io/en/latest/tutorials/dart.html

Can you please support this as well?

guolinke commented 7 years ago

@wxchan

https://github.com/dmlc/xgboost/blob/master/src/gbm/gbtree.cc#L547-L711

can you check the code of implementation in XGBoost, and figure out why?

wxchan commented 7 years ago

@guolinke I read these codes before. It's not hard to be added, but I have no idea where he came upon those parameters.

wxchan commented 7 years ago

@gugatr0n1c

Can you provide how you set up skip_rate etc. for xgboost?
Do you know where xgboost gets the ideas of skip_rate, sample_type and normalize_type?
Can you show accuracy comparisons between xgboost and lightgbm (both gbdt & dart) for your task?

guolinke commented 7 years ago

@wxchan, I am not familiar with XGBoost. Following is my guess:

weight_drop is the weights of all trees, not just dropped trees, right? If yes, I think the line you refer to is a bug. it should be weight_drop[idx_drop[i]] *= factor . If this is a bug, I am not sure why XGBoost can preform better....
It seems drop tree logic is based on sum_weight in XGBoost, but LightGBM is just random. Maybe we can adapt this.

wxchan commented 7 years ago

@guolinke I can add that option. I will do some investigations on this.

gugatr0n1c commented 7 years ago

@wxchan ad1] usually I run random search on parameters, and for xgboost on my datasets best setting is: skip_rate = 0.3 drop_rate = 0.2 I use default for next two dart parameters.

ad2] actually dart for xgboost was done with my issue here https://github.com/dmlc/xgboost/issues/809 And it seems to that marugari add skip_rate just like his idea. On other two setting I do not know the source.

ad3] my task is regression problem with 500k rows, 400 features, and RMSE on cross validation: randomforest from scikit: 5.46 deepnet (5 hidden layers, prelu, adam, xavier) from mxnet 5.43 xgboost without dart: 5.421 xgboost with dart: 5.419 lightgbm without dart: 5.418 lightgbm with dart: 5.5 - not a chance to beat randomforest

For all methods I did some random search of parameters and method should be comparable in the sence of RMSE. Speed is best for deepnet - but it is different algorithm (also depends on settings and hardware).

wxchan commented 7 years ago

@gugatr0n1c Can you test two things on your dataset:

For lightgbm dart, set drop_rate to a very small number, such as drop_rate=1/num_iter; because your num_iter is big, each trees may be dropped too many times;
For xgboost dart, set learning rate=1. For dart, learning rate is a different concept from gbdt. In original paper, it's fixed to 1.

marugari commented 7 years ago

It's not so difficult to add skip_rate and normalize_type. If there are users, I will PR.

gugatr0n1c commented 7 years ago

@wxchan ad1] I tried with 0.005, 0.0005, 0.0001 and got similar "bad" results... strange is that early stopping stop training very early (hundreds of iterations with compare two several thousands without dart)

ad2] I can do this tomorrow from work

wxchan commented 7 years ago

@marugari where do you get the idea about skip_rate and normalize_type?

another question: when sample_type = weighted in xgboost, dparam.rate_drop * weight_drop.size() * weight_drop[i] / sum_weight (from here) can be bigger than 1, the probability of 'this' tree being selected is calculated as some number bigger than 1 but actually 1. Will it cause the expectation of dropped trees number smaller than rate_drop?

marugari commented 7 years ago

@wxchan Regarding skip_rate, All we have to do is leaving idx_drop empty. In order to use learning_rate < 1 in dart, I propose the following:

/* dart.hpp L:103 */
if (normalize_type == 1) {
  shrinkage_rate_ = learning_rate_ / (1.0 + learning_rate_);
} else {
  shrinkage_rate_ = learning_rate_ / (drop_index_.size() + learning_rate_);
}
/* dart.hpp L:119 */
if (normalize_type == 1) {
  models_[curr_tree]->Shrinkage(-k / learning_rate_);
} else {
  models_[curr_tree]->Shrinkage(-1 / learning_rate_);
}

sample_type is very rough, since I think it's not involved in convergence.

wxchan commented 7 years ago

@marugari sorry, what I meant is, is there some paper regarding skip_rate and normalize_type, because they are not shown in dart original paper?

marugari commented 7 years ago

@wxchan sorry. They are my extensions, not published.

wxchan commented 7 years ago

@marugari I see, thanks.

wxchan commented 7 years ago

If you want to use skip_rate, you can arrange a callback with changing drop_rate.

Just my opinion on this: I think the parameters I add 'here' should be tested, I am not saying they are wrong, they can be wonderful ideas, but I want to be convinced before I add them 'here'.

skip_rate and sample_type added to my dart branch.

gugatr0n1c commented 7 years ago

@wxchan If I set in xgboost learning_rate to 1, accuracy is very bad as well, even worse than here...

wxchan commented 7 years ago

@gugatr0n1c in original paper, they only use shrinkage_rate=1/(1+num_drop_trees); as marugari said, he set shrinkagerate = learningrate / (dropindex.size() + learningrate); I think this is the main difference rather than skip_rate or sample_type.

But according to code in xgboost:

float factor = 1.0 * num_drop / (num_drop + lr);
for (size_t i = 0; i < idx_drop.size(); ++i) {
  weight_drop[i] *= factor;
}
for (size_t i = 0; i < size_new_trees; ++i) {
  weight_drop.push_back(1.0 / (num_drop + lr));
}

If num_drop is a reasonable integer, num_drop+lr should not have big differences when lr=1 or lr=0.1. It has a lot of math involved in deciding these weights, I actually don't fully get it.

Another difference: in original paper, their strategy is at least dropping one tree each round, which means num_drop >= 1, this is what I implemented too. xgboost don't have this 'at least one' part(in the mean time, they have skip_drop, with making num_drop=0 directly). It means, with a lr = 0.1 for example, weights of new trees can be very large (1/(0+0.1)=10). I think it's not correct.

gugatr0n1c commented 7 years ago

As I wrote I did not used 'sample_type' parameter. I only used 'drop_rate' and 'skip_rate' (together with learning_rate << 1) in xgboost. And it was working nicely. I can not say now the influence of 'skip_rate'. But when I used random search for hyperparameters tuning, it always wanted to set 'skip_rate' to something non zero. But, yes I agree that the main trick is to adapt shrinkage_rate as written.

But to be honest, if I can use below to simulate 'skip_rate', then it is not necessary to add this parameter, just maybe mention it as an example about callback? Actually it is even better solution, because I can then randomize drop_rate together with skip_rate (this is not even in xgboost) during learning process by some logic in generating drop_rate_list.

-->simulate 0.33 skip_rate drop_rate_list = np.random.choice([0., 0.3, 0.3], iteration_count).tolist() lg.train( params, data_train, num_boost_round = iteration_count, valid_sets = data_valid, early_stopping_rounds = 50, callbacks = [lg.reset_parameter(drop_rate = drop_rate_list)] )

So I believe there are nice things to do: 1] allow change drop_rate in callback, should work now right?, 2] change shrinkage_rate as marugari wrote.

marugari commented 7 years ago

@wxchan Although weight_drop of new trees can be large, leaf values of that are scaled by learning_rate. I think weight_drop * learning_rate = 1.0.

wxchan commented 7 years ago

@marugari oh sorry, I didn't see the if (num_drop == 0) branch. My fault. I also didn't know xgboost calculates learning_rate in another place, I thought it's calculated only in CommitModel.

Then I think it's reasonable to change shrinkage_rate to see if it works. It is actually a combination of shrinkage_rate in gbdt and normalization weight in dart.

guolinke commented 7 years ago

@gugatr0n1c can you test again with latest code?

gugatr0n1c commented 7 years ago

@guolinke I tried, but got error: Segmentation fault, I tried twice recompile from github

boostint_type = 'gbdt' is OK boostint_type = 'dart' return Segmentation fault

guolinke commented 7 years ago

@wxchan can you take a look about this?

wxchan commented 7 years ago

@guolinke it seems caused by c_api because cmd line version is fine. I take a look at segfault log: it happened in Dart::Init, seems gbdtconfig is still null after GBDT::ResetTrainingData.

guolinke commented 7 years ago

@gugatr0n1c @wxchan fixed, can you try it again?

gugatr0n1c commented 7 years ago

@guolinke training seems to be working, give you know about accuracy

gugatr0n1c commented 7 years ago

@guolinke seems to be not working properly. It almost not converge (even worse than before). Seems to me that first tree is build with very large learning_rate, than additional trees has huge problem to converge.

Tried with dart_rate: 0.3, 0.05, 0.001 and with learning_rate = 0.004 and 0.1 - almost same result on all settings.

guolinke commented 7 years ago

@gugatr0n1c OK, I see. @wxchan , I also try this, the training error is small. seems the learning rate is still wrong. I think the learning rate is still 1 at the first iteration..

wxchan commented 7 years ago

But all I change is to make learning rate a variable rather than fixed to 1. So if you set learning rate=1, it's same as before. The first tree is built with shrinkage_rate=1, it's also same as before.

I actually test this setting when I implemented dart and found it not work so well. So I didn't add it before.

Can you try to set learning rate to some number near 1 like 0.8 or 0.9? Because shrinkage_rate will divide num_drop in dart, I don't think it should set very small.

guolinke commented 7 years ago

@wxchan ,

when learning_rate = 1.0: weight of new tree: 1/(k+1) weight of drop tree: 1/(k'+1) * k/(k+1). sum of these weight is approx to 1.

current implementation: weight of new tree: lr/(k+lr) weight of drop tree: lr/(k+lr)* k/(k+lr) sum of these weights is still approx to 1.

However, I think if we add learning rate to DART, may be sum of these weights should approx to learning_rate.

So, maybe:

weight of new tree: lr/(k+1) weight of drop tree: lr/(k+1)* k/(k+1)

is better?

wxchan commented 7 years ago

@guolinke that is exactly what I have tried before and I think it's better, that's why I said it's a combination of gbdt shrinkage rate (lr term) and dart dropped weight (1/(k+1) term).

Current implementation is according to xgboost. All we know here is xgboost dart works better on some datasets, I think we still need better theory.

By the way, I just fixed that bug of xgboost dart I mentioned it earlier, check out here.

gugatr0n1c commented 7 years ago

I read the original paper and there is several thing to point out: 1] setting, where DART is beating MART is small number of trees (250), this is not typical in practice (maybe for randomforest, not for boosting on bigger datasets) - table 5., not sure if we can replicate results for 5k trees with small learning_rate (little bit questioning result from paper for bigger datasets).

2] 'core' idea is to built equivalently strong trees, this is interesting when lightgbm is building leaf_wise not level_wise - this can be different from xgboost - as in paper is shown different tree architecture figure 2.

3] Critical part for boosting is very small learning_rate, I do not believe this can substituted just by drop_rate for datasets where we need thousands of trees. Setting learning_rate to 1 kills convergence almost always after max hundreds of iterations.

4] What I believe is useful from the paper is idea of dropping trees. Because it works as next regularizer and this is what marugari did in xgboost implementation. But sure, there is no paper or something to backup this.

So bottom line: we need set learning_rate small and it is nice to have chance to drop trees as regularizer... but sure, it is little bit different against the paper. Up to you guys what to do.

guolinke commented 7 years ago

@wxchan I think xgboost using learning rate at here, not in GBTree.

So the learning rate of DRAT is not 1.0.

update:

sorry, I am wrong. xgboost did use lr/(k+lr)

guolinke commented 7 years ago

@gugatr0n1c can you try the xgboost with latest code? @wxchan just fix bug in xgboost.

wxchan commented 7 years ago

@guolinke but weight of the first tree is 1/(num_drop+lr) = 1/lr, it canceled out the lr.

guolinke commented 7 years ago

@wxchan , I see. Maybe we can do some experiments for these different settings. I think we can choose our own algorithms, not limit to the implementation in xgboost.

wxchan commented 7 years ago

I think:

weight of new tree: lr/(k+1)
weight of drop tree: lr/(k+1)* k/(k+1)

will be better. In this case, tree normalization is same as before, just new tree shrinks by lr.

guolinke commented 7 years ago

@wxchan I think this may can help

marugari commented 7 years ago

New trees and sum of dropped trees has similar leaf values. It seems that shrinking just new trees is improper.

guolinke commented 7 years ago

@marugari In you implementation. if not trees are dropped, add new_tree with weight lr. Total delta weight is lr.

if trees are dropped, add new_tree with lr/(k+lr), and shrink k trees by k/(k+lr). Total delta weight is lr/(k+lr) - k * lr/(k+lr)* w_j. w_j is the previous weight of drop tree j. w_j can be lr * (k/(k+lr))^i (if skipped) or lr/(k+lr)*(k/(k+lr))^i(not skip), i is drop count of this tree.

I think lr/(k+lr) - k * lr/(k+lr)* w_j is far smaller than lr, since lr/(k+lr) < lr and it minus a positive number. And when k==0(means not drop), it equal to 1, this is not consist with your operation when not trees are dropped.

BTW,

In original paper, delta weight is: 1/(k+1) - k/((k+1)*(k+1)) = 1/((k+1)(k+1))

Do you think let delta wight approx to lr/((k+1)*(k+1)) is better?

marugari commented 7 years ago

This is a little inaccurate. I'm Sorry...

New trees and sum of dropped trees scaled by previous weights has similar leaf values. Thus, the normalization factor is independent of previous weights.

If weights of dropped trees are small. new trees have also small leaf values. I don't think the delta weight is suitable characteristic.

guolinke commented 7 years ago

@marugari I think i understand your idea now. You let weight of new tree approx to the sum of dropped trees, and use skip_rate to add new tree and increase total weight?

It seems when skip_rate==0.5 has best accuracy.

My idea is let total weight to be increased at every iteration. I think this may save half of iterations while achieve same accuracy.

marugari commented 7 years ago

@guolinke You are right. In the case of small learning_rate, skipping iterations are main sources of loss reduction.

guolinke commented 7 years ago

I just push a commit at dart branch: https://github.com/Microsoft/LightGBM/commit/518bafd53570ebdf0db84aceefa989dd24a32394

It adds skip_drop and xgboost_dart_mode. I also try some experiments, and it seems xgboost_dart_mode works well in small dataset.

@gugatr0n1c can you try this with your data?

gugatr0n1c commented 7 years ago

love to test this, but have some xmas travelling now, so it will take some time..

guolinke commented 7 years ago

@gugatr0n1c take you time and merry xmas!

@wxchan Do you have any comments?

wxchan commented 7 years ago

It's better now. I think you can merge it to master and update docs for it.

guolinke commented 7 years ago

@wxchan , I am not sure should we keep xgboost_dart_mode? And should give it a better name?

gugatr0n1c commented 7 years ago

@guolinke merry xmas to you as well

I done some tests: for xDart (xgboost_dart_mode = True) I got great accuracy - this is great for my datasets, beating 'gbdt' for Dart (xgboost_dart_mode = False) still bad accuracy... (but done limited tests only with very small learning_rate)

But here both xDart and Dart have training very slow: 15mins (gdbt) vs. several hours for xDart, some iterations are very fast ~ 100ms, but some are about 15-20s (related to skip_drop?), DART in xgboost is not so slower to 'gbdt', is there space to speed it up?

microsoft / LightGBM

Dart - very poor accuracy #126