jkkummerfeld / text2sql-data

A collection of datasets that pair questions with SQL queries.
http://jkk.name/text2sql-data/
Other
538 stars 106 forks source link

imdb has a single example which is not valid SQL #18

Closed DeNeutoy closed 6 years ago

DeNeutoy commented 6 years ago

https://github.com/jkkummerfeld/text2sql-data/blob/master/data/imdb.json#L2442 I think:

SELECT
  MAX(*) # I'm not sure if this is valid?
FROM
  (
    SELECT
      COUNT(DISTINCT (MOVIEalias0.TITLE))
    FROM
      ACTOR AS ACTORalias0,
      CAST AS CASTalias0,
      MOVIE AS MOVIEalias0
    WHERE
      ACTORalias0.NAME = "actor_name0" # Note no AND here
      CASTalias0.AID = ACTORalias0.AID
      AND MOVIEalias0.MID = CASTalias0.MSID 
      GROUP BY MOVIEalias0.RELEASE_YEAR ) AS DERIVED_TABLEalias0 ;

Should be:

SELECT
 # I'm not sure about this part, maybe the original is just correct
 # if you don't specify an alias for the COUNT column
  MAX(DERIVED_TABLEalias0.COUNT_alias0)
FROM
  (
    SELECT
      COUNT(DISTINCT (MOVIEalias0.TITLE)) AS COUNT_alias0
    FROM
      ACTOR AS ACTORalias0,
      CAST AS CASTalias0,
      MOVIE AS MOVIEalias0
    WHERE
      ACTORalias0.NAME = "actor_name0" `AND`
      CASTalias0.AID = ACTORalias0.AID AND
      MOVIEalias0.MID = CASTalias0.MSID
    GROUP BY MOVIEalias0.RELEASE_YEAR ) 
AS DERIVED_TABLEalias0 ;
DeNeutoy commented 6 years ago

FYI The reason i'm finding these tiny things is because i'm trying to write a grammar which parses all of the datasets - this PR should be fairly self explanatory:

https://github.com/allenai/allennlp/pull/1750/files

Currently my coverage stats look like: academic: 100% advising : 57.8% atis : 93% geography : 99.5% imdb: 99.2% restaurants : 100% yelp : 100% scholar: 98.7%

jkkummerfeld commented 6 years ago

Ah, I see. That could be really useful, e.g. for one project we discussed but didn't have time for in which structured output is produced by a decoder constrained by an SQL grammar. Also, looking at your code I see the note that JOINs may be hard, which I agree with :)

Looks like this problem stems from bugs in the original data:

SELECT max(*)
FROM
  ( SELECT count(distinct(movie_0.title))
   FROM actor AS actor_0,
        CAST AS cast_0,
        movie AS movie_0
   WHERE actor_0.name = " Brad Pitt "
      actor_0.aid = cast_0.aid
     AND cast_0.msid = movie_0.mid
   GROUP BY movie_0.release_year ) AS subquery ;

I've applied your fix here https://github.com/jkkummerfeld/text2sql-data/commit/97fbaef5d5e27085b635bfb880578b5862622b10

DeNeutoy commented 6 years ago

Right, that's what we're planning on doing with it 👍. We're definitely planning on making several of these text2sql datasets part of a "semantic parsing framework" in allennlp, which will be released sometime before the end of the year. So thank you very much for being so open and communicative!

jkkummerfeld commented 6 years ago

Great! Please feel free to pull data from the data-development branch for now and when your release date is approaching let me know and we will merge data-development into master as a new release.