H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
H2O cannot handle nested structures of parquet data, like array/list in column.
To support this case structure should be "flatten" during the parsing phase. E.g. column "step" with list as a value should be converted to step_1, step_2, ... step_n, where n is the size of the longest list in a column in the parsed dataset. For lists shorter than n elements values can be filled by NaN (null should not be used to distinguish "empty" values). Nested structure of simple structure should be parsed with defined separator, e.g. "-", like for {cup: {color, shape}}: cup-color, cup-shape.
Flat structure is valid for all algorithms provided by h2o and allows to read nested parquet without information loss.
H2O cannot handle nested structures of parquet data, like array/list in column.
To support this case structure should be "flatten" during the parsing phase. E.g. column "step" with list as a value should be converted to step_1, step_2, ... step_n, where n is the size of the longest list in a column in the parsed dataset. For lists shorter than n elements values can be filled by NaN (null should not be used to distinguish "empty" values). Nested structure of simple structure should be parsed with defined separator, e.g. "-", like for {cup: {color, shape}}: cup-color, cup-shape.
Flat structure is valid for all algorithms provided by h2o and allows to read nested parquet without information loss.
Another solution is to support Apache Arrow, as Ruslan suggests here: https://0xdata.atlassian.net/browse/SW-556?focusedCommentId=51240&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-51240