ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.13k stars 590 forks source link

bug(sqlite): bug with topk in SQLite #2612

Closed datapythonista closed 2 years ago

datapythonista commented 3 years ago

Based on the examples of our tutorial:

import os
import ibis

ibis.options.interactive = True

connection = ibis.sqlite.connect(os.path.join('data', 'geography.db'))
countries = connection.table('countries')
>>> countries
    iso_alpha2 iso_alpha3  iso_numeric fips                  name  \
0           AD        AND           20   AN               Andorra   
1           AE        ARE          784   AE  United Arab Emirates   
2           AF        AFG            4   AF           Afghanistan   
3           AG        ATG           28   AC   Antigua and Barbuda   
4           AI        AIA          660   AV              Anguilla   
..         ...        ...          ...  ...                   ...   
247         YE        YEM          887   YM                 Yemen   
248         YT        MYT          175   MF               Mayotte   
249         ZA        ZAF          710   SF          South Africa   
250         ZM        ZMB          894   ZA                Zambia   
251         ZW        ZWE          716   ZI              Zimbabwe   

              capital   area_km2  population continent  
0    Andorra la Vella      468.0       84000        EU  
1           Abu Dhabi    82880.0     4975593        AS  
2               Kabul   647500.0    29121286        AS  
3           St. Johns      443.0       86754        NA  
4          The Valley      102.0       13254        NA  
..                ...        ...         ...       ...  
247             Sanaa   527970.0    23495361        AS  
248         Mamoudzou      374.0      159042        AF  
249          Pretoria  1219912.0    49000000        AF  
250            Lusaka   752614.0    13460305        AF  
251            Harare   390580.0    13061000        AF

[252 rows x 9 columns]

Note the 252 rows.

Then, filtering by the 3 continents with more countries on them (Africa, Europe and Asia), we should get around half the rows:

>>> countries[countries.continent.topk(3)]
     iso_alpha2 iso_alpha3  iso_numeric fips                  name  \
0            AD        AND           20   AN               Andorra   
1            AE        ARE          784   AE  United Arab Emirates   
2            AF        AFG            4   AF           Afghanistan   
3            AG        ATG           28   AC   Antigua and Barbuda   
4            AI        AIA          660   AV              Anguilla   
...         ...        ...          ...  ...                   ...   
9995         NL        NLD          528   NL           Netherlands   
9996         NO        NOR          578   NO                Norway   
9997         NP        NPL          524   NP                 Nepal   
9998         NR        NRU          520   NR                 Nauru   
9999         NU        NIU          570   NE                  Niue   

               capital  area_km2  population continent  
0     Andorra la Vella     468.0       84000        EU  
1            Abu Dhabi   82880.0     4975593        AS  
2                Kabul  647500.0    29121286        AS  
3            St. Johns     443.0       86754        NA  
4           The Valley     102.0       13254        NA  
...                ...       ...         ...       ...  
9995         Amsterdam   41526.0    16645000        EU  
9996              Oslo  324220.0     5009150        EU  
9997         Kathmandu  140800.0    28951852        AS  
9998             Yaren      21.0       10065        OC  
9999             Alofi     260.0        2166        OC  

[10000 rows x 9 columns]

Looks like it's not filtering, but also duplicating rows massively.

This is the query being generated:

SELECT t0.name,
       t0.continent 
FROM (SELECT t1.iso_alpha2 AS iso_alpha2,
             t1.iso_alpha3 AS iso_alpha3,
             t1.iso_numeric AS iso_numeric,
             t1.fips AS fips,
             t1.name AS name,
             t1.capital AS capital,
             t1.area_km2 AS area_km2,
             t1.population AS population,
             t1.continent AS continent 
      FROM base.countries AS t1,
           (SELECT t1.iso_alpha2 AS iso_alpha2,
                   t1.iso_alpha3 AS iso_alpha3,
                   t1.iso_numeric AS iso_numeric,
                   t1.fips AS fips,
                   t1.name AS name,
                   t1.capital AS capital,
                   t1.area_km2 AS area_km2,
                   t1.population AS population,
                   t1.continent AS continent 
            FROM base.countries AS t1 
            WHERE EXISTS (SELECT 1 
                          FROM (SELECT t3.continent AS continent,
                                t3.count AS count 
                                FROM (SELECT t1.continent AS continent,
                                             count(t1.continent) AS count 
                                      FROM base.countries AS t1
                                      GROUP BY t1.continent) AS t3
                                ORDER BY t3.count DESC
                                LIMIT ?
                                OFFSET ?) AS t2 
                          WHERE t1.continent = t2.continent
                          )
            )
      ) AS t0
LIMIT ?
OFFSET ?
datapythonista commented 3 years ago

I tried to test if the error also happens in PostgreSQL, and looks like it's raising an exception:

ProgrammingError: (psycopg2.errors.SyntaxError) subquery in FROM must have an alias
LINE 2: FROM countries AS t0, (SELECT t0.iso_alpha2 AS iso_alpha2, t...
                              ^
HINT:  For example, FROM (SELECT ...) [AS] foo.

[SQL: SELECT t0.iso_alpha2, t0.iso_alpha3, t0.iso_numeric, t0.fips, t0.name, t0.capital, t0.area_km2, t0.population, t0.continent 
FROM countries AS t0, (SELECT t0.iso_alpha2 AS iso_alpha2, t0.iso_alpha3 AS iso_alpha3, t0.iso_numeric AS iso_numeric, t0.fips AS fips, t0.name AS name, t0.capital AS capital, t0.area_km2 AS area_km2, t0.population AS population, t0.continent AS continent 
FROM countries AS t0 
WHERE EXISTS (SELECT 1 
FROM (SELECT t2.continent AS continent, t2.count AS count 
FROM (SELECT t0.continent AS continent, count(t0.continent) AS count 
FROM countries AS t0 GROUP BY t0.continent) AS t2 ORDER BY t2.count DESC 
 LIMIT %(param_1)s) AS t1 
WHERE t0.continent = t1.continent)) 
 LIMIT %(param_2)s]
[parameters: {'param_1': 3, 'param_2': 10000}]
(Background on this error at: http://sqlalche.me/e/13/f405)

This is the generated SQL which happens to be invalid:

SELECT t0.iso_alpha2, t0.iso_alpha3, t0.iso_numeric, t0.fips, t0.name, t0.capital, t0.area_km2, t0.population, t0.continent 
FROM countries AS t0, (SELECT t0.iso_alpha2 AS iso_alpha2, t0.iso_alpha3 AS iso_alpha3, t0.iso_numeric AS iso_numeric, t0.fips AS fips, t0.name AS name, t0.capital AS capital, t0.area_km2 AS area_km2, t0.population AS population, t0.continent AS continent 
FROM countries AS t0 
WHERE EXISTS (SELECT 1 
FROM (SELECT t2.continent AS continent, t2.count AS count 
FROM (SELECT t0.continent AS continent, count(t0.continent) AS count 
FROM countries AS t0 GROUP BY t0.continent) AS t2 ORDER BY t2.count DESC 
 LIMIT %(param_1)s) AS t1 
WHERE t0.continent = t1.continent)) 
 LIMIT %(param_2)s
datapythonista commented 3 years ago

@jreback I'm thinking on moving the topk tutorial to the Impala backend too, since it looks like at least in SQLite and Postgres is not well supported. Does it sound good?