ianmcook / tidyquery

Query R data frames with SQL
Apache License 2.0
167 stars 12 forks source link

Support SQL-92-style join queries #7

Closed ianmcook closed 4 years ago

ianmcook commented 4 years ago

queryparser now supports queries that use SQL-92-style (explicit) join syntax (https://github.com/ianmcook/queryparser/issues/20). Add support for this to tidyquery.

Limit the initial implementation in tidyquery to two-table joins; implement joins of three or more tables later. With three or more tables, there are major challenges caused by differences in how SQL engines and dplyr perform joins: SQL engines process joins by looking at all the source tables at once, whereas dplyr can only process one join at a time, and in each successive join, the left table is all the tables previously joined. dplyr does not have any built-in mechanism for tracking which columns in a join result came from which tables, and in joins of three or more tables, tidyquery would need to keep track of that to ensure that column references in the join conditions and in other clauses point to the correct columns in the left table.

The dplyr join functions coalesce the join key columns from the left and right tables and can return only these coalesced values, not the separate join key columns from the left and right tables. For the initial implementation in tidyquery, return only the coalesced join key column(s) and disallow qualified references to the join key column(s) from the left and/or right table in outer joins, to prevent users from thinking they're running a query to check for unmatched rows. In a later version of tidyquery, before the join, use mutate() to add new columns to the left and right tables containing the join keys, then join by those columns and remove them after the join; this would preserve the separate join key columns from the left and right tables.