h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.87k stars 1.99k forks source link

Save a h2o.ai model to S3 bucket in python #9260

Closed exalate-issue-sync[bot] closed 1 year ago

exalate-issue-sync[bot] commented 1 year ago

I have been using the command below to save my h2O model into a s3 bucket in python3 (I am using amazon EMR):

h2o.save_model(model=best_gbm1,path='s3://bucketname/folder1/folder2', force=False) but I do get the following error:

H2OServerError: HTTP 500 Server Error: Server error java.lang.RuntimeException: Error: Not implemented Request: None

is it possible to save a H2O model directly to a S3 bucket

exalate-issue-sync[bot] commented 1 year ago

Lauren DiPerna commented: issue is posted on StackOverFlow [here |https://stackoverflow.com/questions/55182284/save-a-h2o-ai-model-to-s3-bucket-in-python]

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: Currently, this is not supported.

PersistS3 Class, line 263.

{code:java} // Store Value v to disk. @Override public void store(Value v) { if( !v._key.home() ) return; throw H2O.unimpl(); // VA only } {code}

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: S3A supports it.

PersistHDFS class: {code:java} @Override public void store(Value v) { // Should be used only if ice goes to HDFS assert this == H2O.getPM().getIce(); assert !v.isPersisted();

byte[] m = v.memOrLoad();
assert (m == null || m.length == v._max); // Assert not saving partial files
store(new Path(_iceRoot, getIceName(v)), m);

} {code}

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: reclassified to an improvement, minor priority - preferred way is to use S3A/S3N (on EMR)

exalate-issue-sync[bot] commented 1 year ago

Prabhu Subramanian commented: Hi All,

Is this also applicable for the below export?

{code:python}h2o.export_file(data_frame ,path='s3a://…..'){code}

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: [~accountid:5b9be0a796cb052b5f65d3a5] yes, the same applies to all export functions - you need to use “s3a” for your exports

exalate-issue-sync[bot] commented 1 year ago

Prabhu Subramanian commented: Hi Michal,

I know this might not be related to this ticket, but I needed some help in understanding the error I am trying to look into, which is related to this ticket. I would really appreciate it if you can help me with the error below which is in a way related to the ticket.

{code:python}h2o.export_file(data_frame ,path='s3a://bucket_name/path/dataset.csv'){code}

Error below:

{code:python}H2OServerError: HTTP 500 Server Error: Server error water.api.HDFSIOException: Error: HDFS IO Failure: accessed URI : s3://com.squarkai.seer.develop.project-8/test/Churn_Train.csv configuration: Configuration: core-default.xml, core-site.xml, hdfs-default.xml, hdfs-site.xml, /Users/prabhusubramanian/Desktop/F Folder/RA Squark/h2o-3.32.0.2/core-site.xml org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 Error Message. -- ResponseCode: 403, ResponseStatus: Forbidden, XML Error Message: <?xml version="1.0" encoding="UTF-8"?>InvalidAccessKeyIdThe AWS Access Key Id you provided does not exist in our records.{code}

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: this looks like you provided invalid AWS access key id, can you make sure it is correct?

exalate-issue-sync[bot] commented 1 year ago

Prabhu Subramanian commented: Hi Michal,

Credentials provided through the XML file actually works for {{h2o.import_file('s3://…')}}

But not for the export statements, even with the {{s3a}} or {{s3n}}. I tried all the possibilities, but no success with the correct credentials provided. I am sure the credentials are right, because of the import statements working well, but not the export statements.

exalate-issue-sync[bot] commented 1 year ago

Kunal Mishra commented: I’ll throw a +1 in for implementing saving to S3 natively! As it is, I’ll probably save locally and use the R package {{aws.s3}} to work around the limitation, for anyone else looking for alternative solutions.

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: [~accountid:5cc0b0886fbf5a10040d2945] thanks for the input, I think it would be a great change to add

exalate-issue-sync[bot] commented 1 year ago

Kunal Mishra commented: Yup. Leaving an implementation here for anybody who comes through looking for the same thing!

{code:r}save_h2o_model_to_s3 <- function(h2o_model, s3_path, save_type = 'model', local_save_dir = tempdir(), keep_local = FALSE, show_progress = TRUE, force = TRUE) {

' @Description: Saves an H2O model to S3

#' @param h2o_model: a reference to the H2O model that needs to be saved
#' @param s3_path: a string containing the name the object should have in S3 (i.e., its "object key" or its intended S3 URI), as supplied to aws.s3::put_object()
#' @param save_type: a string, indicating which h2o.save function to use, between 'model', 'mojo', and 'model_details'
#' @param local_save_dir: An absolute path to the directory in which h2o_model will be saved
#' @param keep_local: Whether or not the local version of the saved h2o_model should be deleted after being pushed to S3
#' @param show_progress: A logical indicating whether to show a progress bar for uploads. Default is given by options("verbose").
#' @param force: A logical, indicating whether to overwrite files that already exist.
#' @Returns: The h2o_model, invisibly

if (save_type == 'model') {
    local_save_path <- h2o::h2o.saveModel(object = h2o_model, path = local_save_dir, force = force)
} else if (save_type == 'mojo') {
    local_save_path <- h2o::h2o.save_mojo(object = h2o_model, path = local_save_dir, force = force)
} else if (save_type == 'model_details') {
    local_save_path <- h2o::h2o.saveModelDetails(object = h2o_model, path = local_save_dir, force = force)
} else {
    assertthat::assert_that(FALSE, msg = 'Unsupported save_type passed to save_h2o_model_to_s3(). Supported types are limited to "model", "model_details", and "mojo"')
}

aws.s3::put_object(
    file = local_save_path,
    object = s3_path,
    multipart = T
)

if (!keep_local) {
    suppressWarnings(file.remove(local_save_path))
}

return(invisible(h2o_model))

}{code}

exalate-issue-sync[bot] commented 1 year ago

Prabhu Subramanian commented: Should we expect this fix in the upcoming version? Has this been fixed? or ignored?

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: [~accountid:5b9be0a796cb052b5f65d3a5] resolved as “fixed”, meaning the code change was implemented and the target release will have this feature working

Fix version was set to 3.34.0.1 which is H2O’s next major release you can expect in 1-2 months.

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: [~accountid:5b9be0a796cb052b5f65d3a5] you are welcome to try this feature in our nightly builds

[http://h2o-release.s3.amazonaws.com/h2o/master/latest.html|http://h2o-release.s3.amazonaws.com/h2o/master/latest.html]

Please keep in mind I just resolved the ticket today and the current nightly will not have it yet. It should appear there after a day or 2.

exalate-issue-sync[bot] commented 1 year ago

Prabhu Subramanian commented: Thank you very much, Michal! Looking forward to it. Appreciate your updates.

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-6364 Assignee: Michal Kurka Reporter: Reyhaneh Esmaielbeiki State: Resolved Fix Version: 3.34.0.1 Attachments: N/A Development PRs: Available

Linked PRs from JIRA

https://github.com/h2oai/h2o-3/pull/5423