h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.87k stars 2k forks source link

Setup test environment to make sure GBM reproducibility across same hardware setup using yarn #6517

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

Wendy Wong commented: MK has fixed the reproducibility issue with GBM across different hardware settings here: [https://h2oai.atlassian.net/browse/PUBDEV-8425|https://h2oai.atlassian.net/browse/PUBDEV-8425|smart-link]

We need to test and make sure this parameter actual performs as we expected it to.

exalate-issue-sync[bot] commented 1 year ago

Wendy Wong commented: 3.35.0.5 or newer

exalate-issue-sync[bot] commented 1 year ago

Wendy Wong commented: Please check with [~accountid:5c355702a217aa69bce55831] for how to setup the environment and what tests to run if need to.

Please check with [~accountid:5f8e6929461cc40075215ee0] on what tests to run.

exalate-issue-sync[bot] commented 1 year ago

Wendy Wong commented: !image-20230208-222652.png|width=784,height=657!

exalate-issue-sync[bot] commented 1 year ago

Wendy Wong commented: Okay, I will propose to run this test to check for reproducibility across different hardware setups:

{noformat}from future import division from builtins import range import sys import h2o from h2o.estimators.gbm import H2OGradientBoostingEstimator import tempfile

helper functions to copy into your notebook

def compare_frame_one_column(f1, f2, tol=1e-6): temp1 = f1.as_data_frame(use_pandas=False) temp2 = f2.as_data_frame(use_pandas=False)

for rowInd in range(1, f1.nrow):
    v1 = float(temp1[rowInd][0])
    v2 = float(temp2[rowInd][0])

    diff = abs(v1 - v2) / max(1.0, abs(v1), abs(v2))
    assert diff <= tol, "Failed frame values check at row {2} and column {3}! frame1 value: {0}, column name: {4}." \
                        " frame2 value: {1}, column name:{5}".format(temp1[rowInd][0], temp2[rowInd][0],
                                                                     rowInd, 0, f1.names[0], f2.names[0])

test starts

fr = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/bigdata/laptop/covtype/covtype.full.csv")

build first model with one hardware configuration

m = H2OGradientBoostingEstimator(seed=1234, score_tree_interval=2) m.train(x=list(range(0, 12)), y="Cover_Type", training_frame=fr) pred = m.predict(fr)

save prediction result for comparison later, remember to give a different name to other hardward setup runs

h2o.download_csv(pred, "/some/dictory/pred.csv")

save model

tmpdir = tempfile.mkdtemp() m_path = m.download_model(tmpdir) # store the model, may need to export model to be accessed by different hardware environment

to compare the predictions from different runs, do this:

pred = h2o.import_file("/path/to/pred.csv") pred2 = h2o.import_file("/path/to/pred2.csv") for index in range(1, pred.ncols): compare_frame_one_column(pred[index], pred2[index])

load model from previous run:

m2 = h2o.load_model(m2_path) # make sure m2_path is accessible in current hardware environment

compare tree structures of both model to make sure they are the same

code by adam valenta

    for ntree in range(ntrees):
         for output_class in ['class_1', 'class_2', 'class_3', 'class_4', 'class_5', 'class_6', 'class_7']:
             tree = H2OTree(model = m, tree_number = ntree, tree_class = output_class)
             tree2 = H2OTree(model = m2, tree_number = ntree, tree_class = output_class)
             assert_list_equals(tree.predictions, tree2.predictions)
             assert_list_equals(tree.thresholds, tree2.thresholds, delta=1e-50) # need to specify delta to check nans
             assert_list_equals(tree.decision_paths, tree2.decision_paths)
             print("Tree", ntree, "class", output_class, "ok"){noformat}
exalate-issue-sync[bot] commented 1 year ago

Wendy Wong commented: If you are checking for reproducibility on the same cluster, you can just do the following:

{noformat}from future import division from builtins import range import sys import h2o from h2o.estimators.gbm import H2OGradientBoostingEstimator import tempfile

helper functions to copy into your notebook

def compare_frame_one_column(f1, f2, tol=1e-6): temp1 = f1.as_data_frame(use_pandas=False) temp2 = f2.as_data_frame(use_pandas=False)

for rowInd in range(1, f1.nrow):
    v1 = float(temp1[rowInd][0])
    v2 = float(temp2[rowInd][0])

    diff = abs(v1 - v2) / max(1.0, abs(v1), abs(v2))
    assert diff <= tol, "Failed frame values check at row {2} and column {3}! frame1 value: {0}, column name: {4}." \
                        " frame2 value: {1}, column name:{5}".format(temp1[rowInd][0], temp2[rowInd][0],
                                                                     rowInd, 0, f1.names[0], f2.names[0])

fr = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/bigdata/laptop/covtype/covtype.full.csv")

build first model with one hardware configuration

m = H2OGradientBoostingEstimator(seed=1234, score_tree_interval=2) m.train(x=list(range(0, 12)), y="Cover_Type", training_frame=fr) pred = m.predict(fr)

h2o.download_csv(pred, "/some/dictory/pred.csv"),

build second model with same hardware configuration

m2 = H2OGradientBoostingEstimator(seed=1234, score_tree_interval=2) m2.train(x=list(range(0, 12)), y="Cover_Type", training_frame=fr) pred2 = m2.predict(fr)

for index in range(1, pred.ncols): compare_frame_one_column(pred[index], pred2[index])

compare tree structures of both model to make sure they are the same

code by adam valenta

    for ntree in range(ntrees):
         for output_class in ['class_1', 'class_2', 'class_3', 'class_4', 'class_5', 'class_6', 'class_7']:
             tree = H2OTree(model = m, tree_number = ntree, tree_class = output_class)
             tree2 = H2OTree(model = m2, tree_number = ntree, tree_class = output_class)
             assert_list_equals(tree.predictions, tree2.predictions)
             assert_list_equals(tree.thresholds, tree2.thresholds, delta=1e-50) # need to specify delta to check nans
             assert_list_equals(tree.decision_paths, tree2.decision_paths)
             print("Tree", ntree, "class", output_class, "ok"){noformat}
exalate-issue-sync[bot] commented 1 year ago

Wendy Wong commented: I set score_tree_interval=1, 2, or 8 in both tests and they still generate different outputs:

!image-20230208-235411.png|width=1516,height=464!

I meant I set both models to have score_tree_interval=1, 2 or 8. They have the same value at the same time.

exalate-issue-sync[bot] commented 1 year ago

Adam Valenta commented: Since there is an issue with variable importance, I ran tests and it seems that the issue is only with variable importance and not with the GBM training, I would focus only on output of the prediction. We can also utilize H2OTree api to check the trees.

Here is a PR: [https://github.com/h2oai/h2o-3/pull/6491|https://github.com/h2oai/h2o-3/pull/6491|smart-link]

h2o-ops commented 1 year ago

JIRA Issue Details

Jira Issue: PUBDEV-8979 Assignee: Arun Aryasomayajula Reporter: Wendy Wong State: Open Fix Version: N/A Attachments: Available (Count: 2) Development PRs: N/A

h2o-ops commented 1 year ago

Attachments From Jira

Attachment Name: image-20230208-222652.png Attached By: Wendy Wong File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-8979/image-20230208-222652.png

Attachment Name: image-20230208-235411.png Attached By: Wendy Wong File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-8979/image-20230208-235411.png

wendycwong commented 11 months ago

@arunaryasomayajula : Any updates on this one?