h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.89k stars 2k forks source link

h2o.import_sql_select() does not respect ORDER BY clause #7030

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

If the {{select_query}} argument to {{h2o.import_sql_select()}} has an {{ORDER BY}} clause, then the rows of the returned data frame are not ordered in the way specified by the {{ORDER BY}} clause.

From what I can gather from reading {{h2o-3-master/h2o-core/src/main/java/water/jdbc/SQLManager.java}}, the reason for this is that {{select_query}} is either wrapped into a subquery or used to create a temporary table. In the first case, the {{ORDER BY}} clause is not used in the outer query, so ordering is not preserved. In the second case, the {{ORDER BY}} clause is not used when reading from the temporary table, so ordering is again not preserved.

I noticed this behavior because I was writing an application that uses {{h2o.import_sql_select()}} with an {{ORDER BY}} clause. My assumption that the returned data frame would be ordered as expected ultimately caused my application to experience large fluctuations in prediction accuracy that were difficult to diagnose.

Would it be possible to ensure that the {{ORDER BY}} clause is respected in {{h2o.import_sql_select()}}? If not, this limitation should at least be documented.

exalate-issue-sync[bot] commented 1 year ago

Zuzana Olajcová commented: Hi, [~accountid:61fa94800d3777006a915ab8] can you provide reproducible example or at least specify the db system you are using, please? I’ve tried multiple queries on PostgreSQL and so far ORDER BY works for me. Thank you!

exalate-issue-sync[bot] commented 1 year ago

Aohan Dang commented: [~accountid:5db1981f654fb20d96acbd52] I’ve been using a proprietary database system that is not in the list of supported databases for the {{h2o.import_sql_select()}} function. However, the issue I described is possible in the supported databases as well.

I’m going to focus on the case where {{fetch_mode='SINGLE'}}.

When we specify {{use_temp_table=False}}, {{h2o.import_sql_select()}} wraps {{select_query}} into a subquery, as indicated by the {{SQLManager.java}} source file. For example, {{SELECT FROM mydata ORDER BY id}} is wrapped into {{SELECT FROM (SELECT * FROM mydata ORDER BY id) sub_h2o_import}}. One issue with this approach is that some database systems such as SQL Server don't allow {{ORDER BY}} in a subquery unless {{TOP}} is also present (see [https://docs.microsoft.com/en-us/sql/relational-databases/performance/subqueries?view=sql-server-ver15#:~:text=The%20SELECT%20query%20of%20a%20subquery%20is%20always%20enclosed%20in%20parentheses.%20It%20cannot%20include%20a%20COMPUTE%20or%20FOR%20BROWSE%20clause%2C%20and%20may%20only%20include%20an%20ORDER%20BY%20clause%20when%20a%20TOP%20clause%20is%20also%20specified.|https://docs.microsoft.com/en-us/sql/relational-databases/performance/subqueries?view=sql-server-ver15#:~:text=The%20SELECT%20query%20of%20a%20subquery%20is%20always%20enclosed%20in%20parentheses.%20It%20cannot%20include%20a%20COMPUTE%20or%20FOR%20BROWSE%20clause%2C%20and%20may%20only%20include%20an%20ORDER%20BY%20clause%20when%20a%20TOP%20clause%20is%20also%20specified.|smart-link] .) Thus, the wrapped query is not syntactically valid, and the database reports a SQL error. MariaDB is different in that it allows {{ORDER BY}} in a subquery but ignores it ([https://mariadb.com/kb/en/why-is-order-by-in-a-from-subquery-ignored/|https://mariadb.com/kb/en/why-is-order-by-in-a-from-subquery-ignored/|smart-link] ). It is up to the database implementation to choose how to order the results when the outer query does not have an {{ORDER BY}} clause. Sometimes, the order just happens to match the order that was expected, which makes providing a reproducible example a challenge.

When we specify {{use_temp_table=False}}, {{h2o.import_sql_select()}} creates a temporary table named {{table_for_h2o_import}} and then populates it using {{select_query}}. Afterward, it executes {{SELECT * FROM table_for_h2o_import}} to create the H2O data frame. Since this query has no {{ORDER BY}} clause, there is no guarantee that the order of results will match the original order specified by {{select_query}}. Again, a challenge with providing a reproducible example is that it is up to the database implementation to choose an order, which sometimes happens to match the order that was expected.

Let me know whether this is enough information to continue working on this problem. I can try to come up with a reproducible example if that is desired.

exalate-issue-sync[bot] commented 1 year ago

Zuzana Olajcová commented: Thanks [~accountid:61fa94800d3777006a915ab8] , I am able to reproduce on both: SQL Server and MariaDB.

h2o-ops commented 1 year ago

JIRA Issue Details

Jira Issue: PUBDEV-8660 Assignee: Zuzana Olajcová Reporter: Aohan Dang State: In Progress Fix Version: N/A Attachments: N/A Development PRs: Available

h2o-ops commented 1 year ago

Linked PRs from JIRA

https://github.com/h2oai/h2o-3/pull/6174