mithrandie / csvq

SQL-like query language for csv
https://mithrandie.github.io/csvq
MIT License
1.5k stars 65 forks source link

csvq complains about ambiguous column name in LATERAL JOIN query. #30

Closed derekmahar closed 4 years ago

derekmahar commented 4 years ago

As I stated very briefly at the start of my LATERAL JOIN test, my test slightly modified the example queries in PostgreSQL’s LATERAL JOIN. Here, I repeat my test using the original and unaltered queries:

$ cat orders.csv | csvq "SELECT * FROM STDIN"
+----+---------+----------------------------+
| id | user_id |         created_at         |
+----+---------+----------------------------+
| 1  | 1       | 2017-06-20 04:35:03.582895 |
| 2  | 2       | 2017-06-20 04:35:07.564973 |
| 3  | 3       | 2017-06-20 04:35:10.986712 |
| 4  | 1       | 2017-06-20 04:58:10.137503 |
| 5  | 3       | 2017-06-20 04:58:17.905277 |
| 6  | 3       | 2017-06-20 04:58:25.289122 |
+----+---------+----------------------------+
-- lateral_query_original1.sql
SELECT user_id, first_order_time, next_order_time, id FROM
  (SELECT user_id, min(created_at) AS first_order_time FROM orders GROUP BY user_id) o1
  LEFT JOIN LATERAL
  (SELECT id, created_at AS next_order_time
   FROM orders
   WHERE user_id = o1.user_id AND created_at > o1.first_order_time
   ORDER BY created_at ASC LIMIT 1)
   o2 ON true;
-- lateral_query_original2.sql
SELECT user_id, first_order_time, next_order_time, id FROM
  (SELECT user_id, min(created_at) AS first_order_time FROM orders GROUP BY user_id) o1
  INNER JOIN LATERAL
  (SELECT id, created_at AS next_order_time
   FROM orders
   WHERE user_id = o1.user_id AND created_at > o1.first_order_time
   ORDER BY created_at ASC LIMIT 1)
   o2 ON true;
$ cat orders.csv | csvq --source lateral_query_original1.sql
lateral_query_original1.sql [L:6 C:10] field user_id is ambiguous
$ cat orders.csv | csvq --source lateral_query_original2.sql
lateral_query_original2.sql [L:6 C:10] field user_id is ambiguous

In both cases, csvq complains that the non-scoped or unaliased reference to user_id in the second or lateral subquery in the FROM clause is ambiguous. Assuming, possibly incorrectly, that the PostgreSQL SQL interpreter follows the SQL standard and the csvq error is incorrect, in the presence of the same column reference in multiple adjacent subqueries in the same FROM clause, when resolving the non-scoped reference to user_id in the second subquery, csvq should choose the instance from within the immediate enclosing scope.

mithrandie commented 4 years ago

Yes, I’m aware of the difference in behavior between csvq and postgresql.

Regardless of what the specification allows, I think that table name modifiers should be used when specifying column names, and there are many minor differences between csvq and standard sql or other DBMS. This is one of them.

mithrandie commented 4 years ago

Wait, I may now have found an easy way to implement it. If it's feasible, I'll do it later.

derekmahar commented 4 years ago

Regardless of what the specification allows, I think that table name modifiers should be used when specifying column names, and there are many minor differences between csvq and standard sql or other DBMS.

I agree with you that it is a good practise to specify table name modifiers explicitly in order to avoid column name ambiguity. I presume the SQL standard allows queries to omit table name modifiers on ambiguous column name references for the benefit of query concision.

mithrandie commented 4 years ago

Fixed the scope in LATERAL subquery and released in version 1.13.1.

derekmahar commented 4 years ago

csvq 1.13.1 didn't report any error when I repeated the test:

$ csvq --version
csvq version 1.13.1
$ csvq --source lateral_query_original1.sql
+---------+----------------------------+-----------------------------+------+
| user_id |      first_order_time      |       next_order_time       |  id  |
+---------+----------------------------+-----------------------------+------+
| 1       | 2017-06-20 04:35:03.582895 | 2017-06-20 04:58:10.137503  | 4    |
| 2       | 2017-06-20 04:35:07.564973 |            NULL             | NULL |
| 3       | 2017-06-20 04:35:10.986712 | 2017-06-20 04:58:17.905277  | 5    |
+---------+----------------------------+-----------------------------+------+
$ csvq --source lateral_query_original2.sql
+---------+----------------------------+-----------------------------+----+
| user_id |      first_order_time      |       next_order_time       | id |
+---------+----------------------------+-----------------------------+----+
| 1       | 2017-06-20 04:35:03.582895 | 2017-06-20 04:58:10.137503  | 4  |
| 3       | 2017-06-20 04:35:10.986712 | 2017-06-20 04:58:17.905277  | 5  |
+---------+----------------------------+-----------------------------+----+
derekmahar commented 4 years ago

Great work!