h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.85k stars 1.99k forks source link

Rapids: when append a vec to an existing data frame, h2o creates a new data frame and sets the R pointer to it; still keeping the original frame in memory #15096

Closed exalate-issue-sync[bot] closed 1 year ago

exalate-issue-sync[bot] commented 1 year ago

Append a vector/column to an existing data frame, Expect- h2o to return the same frame with appended column Get - new frame with the appended column (the old original frame still exists in the memory)

aa = h2o.uploadFile("/Users/nidhimehta/h2o/smalldata/logreg/prostate.csv",destination_frame = "prostate_data")

dim(aa) [1] 380 9

aa$New = floor(aa$PSA)

dim(aa) [1] 380 10

but now- aa is pointing to a new frame - 'RTMP_3' (screen shot from flow attached) Notice : RTMP_3 is (the appended frame) with 10 columns

h2o.ls() key 1 RTMP_3 2 prostate_data

exalate-issue-sync[bot] commented 1 year ago

Raymond Peck commented: Pretty sure this is By Design (tm).

exalate-issue-sync[bot] commented 1 year ago

Cliff Click commented: Not a bug. Proper behavior, working as designed.

exalate-issue-sync[bot] commented 1 year ago

Nidhi Mehta commented: was not a good user experience. I ran out of memory on a biggish dataset bec did not realize that h2o was making copies of my dataset and storing them while returning the same pointer to me. i.e I was assuming that 'aa' (from R) was pointing to the same data frame while internally it was pointing to different frames and storing all those frames.

exalate-issue-sync[bot] commented 1 year ago

Spencer Aiello commented: This was never by design? R does the update in place (and we also always used to that as well...)

exalate-issue-sync[bot] commented 1 year ago

Cliff Click commented: Well, no, R does NOT update in place. At least not always (i.e., exiting a scope). And the different places where this happens, or not, are only visible in the R client - not the H2O server - and mostly only after-the-fact.

For running out of memory on the cluster, try using R's gc() call and see what gets reclaimed.

Cliff

exalate-issue-sync[bot] commented 1 year ago

Nidhi Mehta commented: when run R's gc it does anything. I run OOM bec when i run an append, H2O creates a copy of my whole dataset (puts it into dkv), and keeps the original as well.

aa = h2o.uploadFile("/Users/nidhimehta/h2o/smalldata/logreg/prostate.csv",destination_frame = "prostate_data") [1] 380 9

aa$New = floor(aa$PSA) dim(aa) [1] 380 10 gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 458321 24.5 940480 50.3 750400 40.1 Vcells 663618 5.1 1308461 10.0 925201 7.1 h2o.ls() key 1 RTMP_3 2 prostate_data

exalate-issue-sync[bot] commented 1 year ago

Cliff Click commented: Per Tom's request, "prostate_data" is never deleted automatically, only on request by the user. Nor is it ever modified, unless you assign to it directly. Per R's spec, "aa$New = ..." preserves the entirety of "prostate_data" as-is. No sharing of columns, lest you later assign via "aa"... and inadvertently modify the shared columns (thus modifying a conceptually unrelated dataset). Also, last.value is probably keeping at least 1 thing alive.

So

Both keys are alive as expected.

Cliff

exalate-issue-sync[bot] commented 1 year ago

Cliff Click commented: I'm open to other behaviors, but these behaviors have been strongly encouraged by e.g. Tom, and declared sane by Hank & Mark.

I'm not sure everybody understands the consequences however...

Cliff

exalate-issue-sync[bot] commented 1 year ago

Spencer Aiello commented: I was always under the impression that slot assignment (which this is technically...) is update-in-place....

for the same exact reason that

    a[1] = a[1].asfactor()

doesn't create a whole new frame, keeps the same pointer, etc. Even though the old!=new. So it appears inconsistent with itself, and inconsistent with R behavior regarding appending columns via $<-

DinukaH2O commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-2185 Assignee: Cliff Click Reporter: Nidhi Mehta State: Resolved Fix Version: N/A Attachments: Available (Count: 1) Development PRs: N/A

Attachments From Jira

Attachment Name: append_vec.png Attached By: Nidhi Mehta File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-2185/append_vec.png