h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.94k stars 2k forks source link

Import from SQL table doesn't guarantee to work when importing in chunks and/or over many nodes #15503

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

As stands today H2O uses SQL SELECT with OFFSET/LIMIT logic (or equivalent - depending on database) that doesn't guarantee consistency between consecutive calls. And even more so in distributed environment with multiple processes/nodes connecting to database in parallel to ingest (import) data from a table by dividing it in chunks. This equally applies to PostgreSQL, Teradata and other databases.

Outside of implementing a SQL CURSOR (not feasible) there is an option of adding a new parameter (key or order by?) to order rows that guarantees such consistency when diving table rows in chunks. SQL ORDER BY clause with SELECT and OFFSET/LIMIT logic would have to be applied in accordance with the logic implemented for each database.

New parameter could be simply a character string containing one or more column names separated by comma to use with ORDER BY. For backward compatibility make it optional and roll back to current implementation when it is missing. It'll be user responsibility to use a key (one or more columns) that uniquely order table rows. Using such parameter (correctly) will likely affect performance but guarantee correctness.

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: Thank you for reporting!

DinukaH2O commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-5821 Assignee: New H2O Bugs Reporter: Gregory Kanevsky State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A

wendycwong commented 1 year ago

https://h2oai.atlassian.net/browse/PUBDEV-5821