Closed nicolaskruchten closed 8 years ago
Fantastic results @nicolaskruchten, thanks for submitting (and congrats, @datacratic).
I added the code here: https://github.com/szilard/benchm-ml/blob/master/z-other-tools/9a-datacratic.py
and the results here: https://github.com/szilard/benchm-ml/blob/master/z-other-tools/README.md
I have some questions though:
validation_split = 0.5
seems to leave out part of the dataset. Can you re-run it with validation_split = 0
? (should take 2x longer (?))random_feature_propn
is usually the square root of the number of features, though it depends how you are treating categorical variables (one-hot encoding or something else). Can you play in a range and see how results change?Re previous: or maybe validation_split
is the proportion of data not used for each tree in the forest? (I guess I misunderstood initially as being some kind of holdout related to the "classifier.experiment")
validation_split=0.5 is a parameter for bagging, and causes each tree in the forest to run on a random 1/2 of the data. It's called validation_split because the other half of the data is available to the weak learner to use for early stopping; it's not used by the decision tree classifier but is important for when you use boosting as a weak learner. Were we to use validation_split=1.0, the only diversity would come from the selection of features in the weak learners.
In our experience using bagging in this manner gives better diversity for the trees and a better AUC for held out examples. We'll double-check on the exact effect and get back here, with an extra result if appropriate. I'm certain that It's not a holdout for the classifier experiment, and all of the training data is seen by the classifier training and incorporated into the output.
Categorical features are handled directly by the decision tree training, and aren't expanded as with a one-hot encoding (the decision tree training code does consider each categorical value separately, however, and so the effect is the same on the trained classifier). Thus, the number of features is the same as the number of columns in the dataset, and so sqrt(num features) will be 3 or so, which is too low (it would be faster, but accuracy would suffer, especially with such deep trees).
Stepping back, we are running bagged decision trees to get an effect as close as possible to classical random forests. This produces a committee of decision trees (like random forests) but has different hyperparameters and a different means of introducing entropy into the training set. Typically we would use bagged boosted decision trees, something like 20 bags of 20 rounds of boosting of depth 5 trees, for such a task but that would be hard to meaningfully compare with the other results in the benchmark.
Thanks @jeremybarnes for comments and I guess for authoring JML https://github.com/jeremybarnes/jml - likely the main reason why MLDB is so fast
Wrt validation_split
I misinterpreted it first, but then I soon realized it must be just the random subsampling of the data for each tree as in the standard random forest algo. Thanks for clarifications as well. So, to match the standard RF perhaps validation_split ~ 0.34
(the default in MLDB) is perhaps the best.
I see what you are saying on categorical data, that's what H2O does as well, and from what I can see it's a huge performance boost vs 1-hot encoding. sqrt(num features)
would be 2 or 3 depending on rounding so random_feature_propn
would be 1/2 or 1/3, right?
I think my understanding now fits what you are saying except a bit your last paragraph. I have the impression now that with validation_split
and random_feature_propn
you can do exactly RF (or at least as close as other implementations), it's just that you call it bagging. If you want to simulate that with boosting I guess you would need min_iter = max_iter = 1
and somehow combine 100 of those.
I'm gonna try to run the code by @nicolaskruchten soon.
@jeremybarnes @datacratic @nicolaskruchten I'm trying to run your the code. I get the same AUC even if I change the params e.g. num_bags
, validation_split
etc. (I tried only a few values, but still). E.g. num_bags = 100
or num_bags = 200
give same AUC: 0.7424885267.
That is strange. Can you look at the stdout/err of the docker container? You can see it finishing bags as it goes... The number should correspond to the parameter you set.
That leaves the possibility that there is a problem with the random selection and so each bag is the same. That's something that we can look into over the weekend.
Yes, I can see in the output bags going up to 100 and 200 respectively. It also takes more time to train 200, but AUC is the same. Also the same AUC if I change some of the other numeric params:
"num_bags": 100,
"validation_split": 0.50,
"weak_learner": {
"type": "decision_tree",
"max_depth": 20,
"random_feature_propn": 0.5
I'm sure you guys can figure out what's going on 100x faster than I ;)
I'm running this:
docker run --name=mldb --rm=true -p 127.0.0.1:80:80 quay.io/datacratic/mldb:latest
and then the code by @nicolaskruchten:
https://github.com/szilard/benchm-ml/blob/master/z-other-tools/9a-datacratic.py
Feel free to submit pull request for the above code if you guys make corrections.
It took a little longer than anticipated to look into, but here are the conclusions:
validation_split
to 0 gives an AUC of 0.746 in about twice the time, for example. More on this below.num_bags
hyperparameter. In any case, I was wrong about this being the only source of entropy in my comment above, and we should probably turn it off by setting validation_split=0; it does impact accuracy of the trained classifier. This will be the default for the next MLDB release.random_feature_propn
values all the way up to 0.8. The algorithm is much faster at 0.3 , and substantially slower at 0.8, but accuracy increases as it goes up. At 80% feature probability and 0% validation split, the AUC is at 0.753 but training is more like 40 seconds.I would suggest that we re-submit with the parameters validation_split=0
and random_feature_propn=0.3
(~sqrt(10)) to provide the closest possible comparison with other systems for the purposes of this comparison. Does that make sense?
Thanks Jeremy for clarifications (and yes, it makes absolutely sense).
It's amazing a 15-yr old tool can keep up so well while there are new machine learning tools written every day. I've been saying for a while that machine learning looks more like an HPC problem to me than a "big data" one.
@nicolaskruchten (or @datacratic) can you guys run it and resubmit the results here with the settings @jeremybarnes suggested above? Also would be great if you can update the code https://github.com/szilard/benchm-ml/blob/master/z-other-tools/9a-datacratic.py and send a PR.
I will do an MLDB release and then resubmit the results and code :)
Awesome, thanks.
Thanks for new results. I'm gonna try to verify in a few days.
@nicolaskruchten I was able to verify your latest results. Also it seem the "same AUC" problem has been fixed in the latest release. Thanks @datacratic @nicolaskruchten @jeremybarnes for contributing to the benchmark.
This code gives an AUC of 0.7417 in 12.1s for the 1M training set on an r3.8xlarge EC2 instance with the latest release of Datacratic's Machine Learning Database (MLDB), available at http://mldb.ai/