h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.85k stars 1.99k forks source link

Verify and show how activeProcessorCount can be used to ensure GBM reproducibility #6812

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

Wendy Wong commented: For new H2O-3 versions 3.35.0.5 or newer, Michalk has completed this JIRA: [https://h2oai.atlassian.net/browse/PUBDEV-8425|https://h2oai.atlassian.net/browse/PUBDEV-8425|smart-link] to ensure reproducibility.

Need to show how this works and add documentation on it [https://h2oai.atlassian.net/browse/PUBDEV-8425|https://h2oai.atlassian.net/browse/PUBDEV-8425|smart-link] to ensure reproducibility?

exalate-issue-sync[bot] commented 1 year ago

Wendy Wong commented: Duplication here: https://h2oai.atlassian.net/browse/PUBDEV-8979

exalate-issue-sync[bot] commented 1 year ago

Wendy Wong commented: This basically means that we will run one test in one hardware setup with activeProcessorCount set to a number. Then, we will run another test in another hardware setup with activeProcessorCount set to the same number. Here are some python code you can use to do the run:

Okay, I will propose to run this test to check for reproducibility across different hardware setups:

{noformat}from future import division from builtins import range import sys import h2o import tempfile from h2o.estimators.gbm import H2OGradientBoostingEstimator

helper functions to copy into your notebook

def extract_from_twoDimTable(metricOfInterest, fieldOfInterest, takeFirst=False): """ Given a fieldOfInterest that are found in the model scoring history, this function will extract the list of field values for you from the model.

:param aModel: H2O model where you want to extract a list of fields from the scoring history
:param fieldOfInterest: string representing a field of interest.
:return: List of field values or None if it cannot be found
"""

allFields = metricOfInterest._col_header
return extract_field_from_twoDimTable(allFields, metricOfInterest.cell_values, fieldOfInterest, takeFirst=False)

def extract_field_from_twoDimTable(allFields, cell_values, fieldOfInterest, takeFirst=False): if fieldOfInterest in allFields: cellValues = [] fieldIndex = allFields.index(fieldOfInterest) for eachCell in cell_values: cellValues.append(eachCell[fieldIndex]) if takeFirst: # only grab the result from the first iteration. break return cellValues else: return None

def compare_frame_one_column(f1, f2, tol=1e-6): temp1 = f1.as_data_frame(use_pandas=False) temp2 = f2.as_data_frame(use_pandas=False)

for rowInd in range(1, f1.nrow):
    v1 = float(temp1[rowInd][0])
    v2 = float(temp2[rowInd][0])

    diff = abs(v1 - v2) / max(1.0, abs(v1), abs(v2))
    assert diff <= tol, "Failed frame values check at row {2} and column {3}! frame1 value: {0}, column name: {4}." \
                        " frame2 value: {1}, column name:{5}".format(temp1[rowInd][0], temp2[rowInd][0],
                                                                     rowInd, 0, f1.names[0], f2.names[0])

test starts

fr = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/bigdata/laptop/covtype/covtype.full.csv")

build first model with one hardware configuration

m = H2OGradientBoostingEstimator(seed=1234, score_tree_interval=2) m.train(x=list(range(0, 12)), y="Cover_Type", training_frame=fr) pred = m.predict(fr)

save prediction result for comparison later, remember to give a different name to other hardward setup runs

h2o.download_csv(pred, "/some/dictory/pred.csv") relative_importance = extract_from_twoDimTable(m._model_json["output"]["variable_importances"], "relative_importance", takeFirst=False)

just print out the relative importance and eyeball to see if they are the same

print(relative_importance)

to compare the predictions from different runs, do this:

pred = h2o.import_file("/path/to/pred.csv") pred2 = h2o.import_file("/path/to/pred2.csv") for index in range(1, pred.ncols): compare_frame_one_column(pred[index], pred2[index]){noformat}

h2o-ops commented 1 year ago

JIRA Issue Details

Jira Issue: PUBDEV-8988 Assignee: Arun Aryasomayajula Reporter: Wendy Wong State: Reopened Fix Version: N/A Attachments: N/A Development PRs: N/A