_wrap_select_list_columns makes for incorrect query when GROUP BY is used

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?

select continent, count(*) from world.Country group by continent;
+---------------+----------+
| continent     | count(*) |
+---------------+----------+
| Asia          |       51 |
| Europe        |       46 |
| North America |       37 |
| Africa        |       58 |
| Oceania       |       28 |
| Antarctica    |        5 |
| South America |       14 |
+---------------+----------+

set @q := 'select continent, count(*) from world.Country group by continent';
call common_schema._wrap_select_list_columns(@q, 9, @err);
select @q;
+-------------------------------------------------------------------------------
--------------------------------------------------------------------------------
------------------------------------------+
| @q                                                                            

                                          |
+-------------------------------------------------------------------------------
--------------------------------------------------------------------------------
------------------------------------------+
| select (select  continent) as col1, (select  count(*) ) as col2, null as 
col3, null as col4, null as col5, null as col6, null as col7, null as col8, 
null as col9 from world.Country group by continent |
+-------------------------------------------------------------------------------
--------------------------------------------------------------------------------
------------------------------------------+
select (select  continent) as col1, (select  count(*) ) as col2, null as col3, 
null as col4, null as col5, null as col6, null as col7, null as col8, null as 
col9 from world.Country group by continent ;
+---------------+------+------+------+------+------+------+------+------+
| col1          | col2 | col3 | col4 | col5 | col6 | col7 | col8 | col9 |
+---------------+------+------+------+------+------+------+------+------+
| Asia          |    1 | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
| Europe        |    1 | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
| North America |    1 | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
| Africa        |    1 | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
| Oceania       |    1 | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
| Antarctica    |    1 | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
| South America |    1 | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
+---------------+------+------+------+------+------+------+------+------+

The idea of wrapping columns as "(select __original_column_definition__) as 
col1" appears to be completely wrong when aggregation is used.

This makes for a severe problem. Any ideas out of it?

Original issue reported on code.google.com by shlomi.n...@gmail.com on 6 Nov 2011 at 6:32

GoogleCodeExporter commented 9 years ago

I'm seeing several possible solutions:

1. Back to square one: do better work at parsing a query, plant the "col1", 
"col2" aliases directly inside the query, without adding the wrapping "(select 
...) col1".
This means identifying and replacing existing aliases, with or without the "AS" 
keyword.
I think this will be very complicated.

2. Back to square zero: create a temporary table into which the results of the 
user's query are inserted. We use the   _wrap_select_list_columns to identify 
the number of columns in the original query.
- We may choose to pad the original query with NULLs so as to reach a 
pre-defined number of columns
- Or we may just detect the number of columns, thereby writing the correct 
INSERT INTO my_tmp (col1, ..., coln) SELECT... 
- Or we may yet modify the original query by wrapping an outer query around it, 
making for a temporary table, with our own desired aliases:
  SELECT col1, ..., coln FROM (__original_query__) select_original

Thoughts?

Original comment by shlomi.n...@gmail.com on 6 Nov 2011 at 6:45

GoogleCodeExporter commented 9 years ago

Last thought, is there a way to somehow fix the incorrectness of the GROUP BY 
problem?

Original comment by shlomi.n...@gmail.com on 6 Nov 2011 at 6:46

GoogleCodeExporter commented 9 years ago

Shlomi, 

I think we should aim for option 1 - parse and insert aliases propery, without 
wrapping columns in (...).

I recall there was something complicated about that approach, but it escapes me 
what that was exactly.

Alternatively, we could rewrite queries to look like this:

SELECT NULL AS _dummy, @1:=continent, @2:=count(*), ..

where _dummy is a generated column so you have a known column to fetch in the 
cursor loop, and @1, @2 etc are variables to capture the actual values. 

This should be relatively simple (inject dummy column immediately after the 
top-level SELECT, and inject variable after each top-level comma). The bonus to 
this approach is that you get unlimited number of column references - simply 
replace every variable reference with its corresponding user-defined variable. 
The downside of that approach is that it won't work with UNION in the top-level 
query - those queries would have to be wrapped in a SELECT of their own.  

I'll try and investigate how hard it would be to go for solution 1, ok?

Original comment by roland.bouman on 6 Nov 2011 at 9:26

GoogleCodeExporter commented 9 years ago

Roland,

To answer your question, the problem was that it was very difficult to analyze 
the aliases for some columns; sometimes with AS, sometimes without.

I think the solution with user variables is equally complicated, since you 
would have to wrap the exact definition in order to avoid priority issues:
SELECT @col1 := val + bonus AS amount
vs
SELECT (@col1 := (val + bonus)) AS amount
Need to verify with myself that this is actually required.

If it does: if you can do the latter properly, that also solves your first 
problem.

Original comment by shlomi.n...@gmail.com on 7 Nov 2011 at 10:21

GoogleCodeExporter commented 9 years ago

Shlomi, good point - it looks like whatever solution we take, at the core we 
simply need to properly parse column expressions and their aliases.

I think it should be possible to use a rather simple model for this. I will get 
back at this later this evening.

Original comment by roland.bouman on 7 Nov 2011 at 11:09

GoogleCodeExporter commented 9 years ago

Roland,

The solution of using temporary tables can make for an easier workaround, I 
think. I'll give it further thought as well.

Original comment by shlomi.n...@gmail.com on 7 Nov 2011 at 11:17

GoogleCodeExporter commented 9 years ago

Fixed in revision 149
Solution based on parsing+modifying query statement

Original comment by shlomi.n...@gmail.com on 10 Nov 2011 at 11:04

Changed state: Verified

mesaque / common-schema

_wrap_select_list_columns makes for incorrect query when GROUP BY is used #18