szilard / benchm-ml

A minimal benchmark for scalability, speed and accuracy of commonly used open source implementations (R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc.) of the top machine learning algorithms for binary classification (random forests, gradient boosted trees, deep neural networks etc.).
MIT License
1.87k stars 335 forks source link

Datacratic MLDB results #25

Closed nicolaskruchten closed 8 years ago

nicolaskruchten commented 8 years ago

This code gives an AUC of 0.7417 in 12.1s for the 1M training set on an r3.8xlarge EC2 instance with the latest release of Datacratic's Machine Learning Database (MLDB), available at http://mldb.ai/

from pymldb import Connection
mldb = Connection("http://localhost/")

mldb.v1.datasets("bench-train-1m").put({
    "type": "text.csv.tabular",
    "params": { "dataFileUrl": "https://s3.amazonaws.com/benchm-ml--main/train-1m.csv" }
})

mldb.v1.datasets("bench-test").put({
    "type": "text.csv.tabular",
    "params": { "dataFileUrl": "https://s3.amazonaws.com/benchm-ml--main/test.csv" }
})

mldb.v1.procedures("benchmark").put({
    "type": "classifier.experiment",
    "params": {
        "experimentName": "benchm_ml",
        "training_dataset": {"id": "bench-train-1m"},
        "testing_dataset": {"id": "bench-test"},
        "configuration": {
            "type": "bagging",
            "num_bags": 100,
            "validation_split": 0.50,
            "weak_learner": {
                "type": "decision_tree",
                "max_depth": 20,
                "random_feature_propn": 0.5
            }
        },
        "modelFileUrlPattern": "file://tmp/models/benchml_$runid.cls",
        "label": "dep_delayed_15min = 'Y'",
        "select": "* EXCLUDING(dep_delayed_15min)",
        "mode": "boolean"
    }
})

import time

start_time = time.time()

result = mldb.v1.procedures("benchmark").runs.post({})

run_time = time.time() - start_time
auc = result.json()["status"]["folds"][0]["results"]["auc"]

print "\n\nAUC = %0.4f, time = %0.4f\n\n" % (auc, run_time)
szilard commented 8 years ago

Fantastic results @nicolaskruchten, thanks for submitting (and congrats, @datacratic).

I added the code here: https://github.com/szilard/benchm-ml/blob/master/z-other-tools/9a-datacratic.py

and the results here: https://github.com/szilard/benchm-ml/blob/master/z-other-tools/README.md

I have some questions though:

  1. validation_split = 0.5 seems to leave out part of the dataset. Can you re-run it with validation_split = 0? (should take 2x longer (?))
  2. random_feature_propn is usually the square root of the number of features, though it depends how you are treating categorical variables (one-hot encoding or something else). Can you play in a range and see how results change?
szilard commented 8 years ago

Re previous: or maybe validation_split is the proportion of data not used for each tree in the forest? (I guess I misunderstood initially as being some kind of holdout related to the "classifier.experiment")

jeremybarnes commented 8 years ago

validation_split=0.5 is a parameter for bagging, and causes each tree in the forest to run on a random 1/2 of the data. It's called validation_split because the other half of the data is available to the weak learner to use for early stopping; it's not used by the decision tree classifier but is important for when you use boosting as a weak learner. Were we to use validation_split=1.0, the only diversity would come from the selection of features in the weak learners.

In our experience using bagging in this manner gives better diversity for the trees and a better AUC for held out examples. We'll double-check on the exact effect and get back here, with an extra result if appropriate. I'm certain that It's not a holdout for the classifier experiment, and all of the training data is seen by the classifier training and incorporated into the output.

Categorical features are handled directly by the decision tree training, and aren't expanded as with a one-hot encoding (the decision tree training code does consider each categorical value separately, however, and so the effect is the same on the trained classifier). Thus, the number of features is the same as the number of columns in the dataset, and so sqrt(num features) will be 3 or so, which is too low (it would be faster, but accuracy would suffer, especially with such deep trees).

Stepping back, we are running bagged decision trees to get an effect as close as possible to classical random forests. This produces a committee of decision trees (like random forests) but has different hyperparameters and a different means of introducing entropy into the training set. Typically we would use bagged boosted decision trees, something like 20 bags of 20 rounds of boosting of depth 5 trees, for such a task but that would be hard to meaningfully compare with the other results in the benchmark.

szilard commented 8 years ago

Thanks @jeremybarnes for comments and I guess for authoring JML https://github.com/jeremybarnes/jml - likely the main reason why MLDB is so fast

Wrt validation_split I misinterpreted it first, but then I soon realized it must be just the random subsampling of the data for each tree as in the standard random forest algo. Thanks for clarifications as well. So, to match the standard RF perhaps validation_split ~ 0.34 (the default in MLDB) is perhaps the best.

I see what you are saying on categorical data, that's what H2O does as well, and from what I can see it's a huge performance boost vs 1-hot encoding. sqrt(num features) would be 2 or 3 depending on rounding so random_feature_propn would be 1/2 or 1/3, right?

I think my understanding now fits what you are saying except a bit your last paragraph. I have the impression now that with validation_split and random_feature_propn you can do exactly RF (or at least as close as other implementations), it's just that you call it bagging. If you want to simulate that with boosting I guess you would need min_iter = max_iter = 1 and somehow combine 100 of those.

I'm gonna try to run the code by @nicolaskruchten soon.

szilard commented 8 years ago

@jeremybarnes @datacratic @nicolaskruchten I'm trying to run your the code. I get the same AUC even if I change the params e.g. num_bags, validation_split etc. (I tried only a few values, but still). E.g. num_bags = 100 or num_bags = 200 give same AUC: 0.7424885267.

jeremybarnes commented 8 years ago

That is strange. Can you look at the stdout/err of the docker container? You can see it finishing bags as it goes... The number should correspond to the parameter you set.

That leaves the possibility that there is a problem with the random selection and so each bag is the same. That's something that we can look into over the weekend.

szilard commented 8 years ago

Yes, I can see in the output bags going up to 100 and 200 respectively. It also takes more time to train 200, but AUC is the same. Also the same AUC if I change some of the other numeric params:

            "num_bags": 100,
            "validation_split": 0.50,
            "weak_learner": {
                "type": "decision_tree",
                "max_depth": 20,
                "random_feature_propn": 0.5

I'm sure you guys can figure out what's going on 100x faster than I ;)

szilard commented 8 years ago

I'm running this: docker run --name=mldb --rm=true -p 127.0.0.1:80:80 quay.io/datacratic/mldb:latest and then the code by @nicolaskruchten: https://github.com/szilard/benchm-ml/blob/master/z-other-tools/9a-datacratic.py

Feel free to submit pull request for the above code if you guys make corrections.

jeremybarnes commented 8 years ago

It took a little longer than anticipated to look into, but here are the conclusions:

I would suggest that we re-submit with the parameters validation_split=0 and random_feature_propn=0.3 (~sqrt(10)) to provide the closest possible comparison with other systems for the purposes of this comparison. Does that make sense?

szilard commented 8 years ago

Thanks Jeremy for clarifications (and yes, it makes absolutely sense).

It's amazing a 15-yr old tool can keep up so well while there are new machine learning tools written every day. I've been saying for a while that machine learning looks more like an HPC problem to me than a "big data" one.

@nicolaskruchten (or @datacratic) can you guys run it and resubmit the results here with the settings @jeremybarnes suggested above? Also would be great if you can update the code https://github.com/szilard/benchm-ml/blob/master/z-other-tools/9a-datacratic.py and send a PR.

nicolaskruchten commented 8 years ago

I will do an MLDB release and then resubmit the results and code :)

szilard commented 8 years ago

Awesome, thanks.

szilard commented 8 years ago

Thanks for new results. I'm gonna try to verify in a few days.

szilard commented 8 years ago

@nicolaskruchten I was able to verify your latest results. Also it seem the "same AUC" problem has been fixed in the latest release. Thanks @datacratic @nicolaskruchten @jeremybarnes for contributing to the benchmark.