h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.86k stars 1.99k forks source link

Support nested structures of parquet data #12568

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

H2O cannot handle nested structures of parquet data, like array/list in column.

To support this case structure should be "flatten" during the parsing phase. E.g. column "step" with list as a value should be converted to step_1, step_2, ... step_n, where n is the size of the longest list in a column in the parsed dataset. For lists shorter than n elements values can be filled by NaN (null should not be used to distinguish "empty" values). Nested structure of simple structure should be parsed with defined separator, e.g. "-", like for {cup: {color, shape}}: cup-color, cup-shape.

Flat structure is valid for all algorithms provided by h2o and allows to read nested parquet without information loss.

Another solution is to support Apache Arrow, as Ruslan suggests here: https://0xdata.atlassian.net/browse/SW-556?focusedCommentId=51240&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-51240

hasithjp commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-5712 Assignee: New H2O Bugs Reporter: windyk State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A