jkkummerfeld / text2sql-data

A collection of datasets that pair questions with SQL queries.
http://jkk.name/text2sql-data/
Other
534 stars 105 forks source link

How should "sql-only" variables in the Advising dataset be handled? #12

Closed DeNeutoy closed 5 years ago

DeNeutoy commented 6 years ago

In the advising dataset, there are a few examples which raise some questions.

  1. Some of the questions paired with a single SQL query do not have equivalent semantics (no problem if this is just part and parcel of building a dataset, I can imagine there is some noise.) An example of this are the first two questions below, where you would expect one to return a list of classes which don't have lab sessions, whereas the other you would expect to return a boolean value. Is this a broad distinction that is not drawn in the standardisation of the datasets, or is this just an error?

  2. This query has a dangling "AND" statement, which I don't think is valid SQL?

  3. What is the significance of "sql-only" variables? Are these variables for which there is only one possible value, or that there is a default value?

  4. There exist some variables in the query which appear to be default values, but which differ in value and are not extracted as sql variables (perhaps this is just an issue with the automated extraction and there's nothing you can do about it). For instance, if you search for "2016" and "2017" in the dataset, they are used interchangeably for queries containing "next year/next semester". Do you have a recommendation for how to treat these?

{
   "query-split":"dev",
   "sentences":[
      {
         "question-split":"train",
         "text":"What classes do n't have lab sessions ?",
         "variables":{
            "department0":""
         }
      },
      {
         "question-split":"train",
         "text":"Are there any classes that do n't have lab sessions ?",
         "variables":{
            "department0":""
         }
      },
      {
         "question-split":"train",
         "text":"As far as labs go , are there any classes without them ?",
         "variables":{
            "department0":""
         }
      },
      {
         "question-split":"test",
         "text":"Do any classes not have lab sessions ?",
         "variables":{
            "department0":""
         }
      },
      {
         "question-split":"train",
         "text":"For classes , which ones do n't have labs ?",
         "variables":{
            "department0":""
         }
      },
      {
         "question-split":"train",
         "text":"If I do n't want to have a lab session , which classes should I take ?",
         "variables":{
            "department0":""
         }
      },
      {
         "question-split":"test",
         "text":"Is there a list of classes that do n't have lab sessions ?",
         "variables":{
            "department0":""
         }
      },
      {
         "question-split":"train",
         "text":"List classes without lab sessions .",
         "variables":{
            "department0":""
         }
      },
      {
         "question-split":"test",
         "text":"Of these classes , which do n't have lab sessions ?",
         "variables":{
            "department0":""
         }
      },
      {
         "question-split":"train",
         "text":"What classes do not have sessions in the lab ?",
         "variables":{
            "department0":""
         }
      },
      {
         "question-split":"test",
         "text":"What classes have no lab sessions ?",
         "variables":{
            "department0":""
         }
      },
      {
         "question-split":"train",
         "text":"What courses are not in the lab ?",
         "variables":{
            "department0":""
         }
      },
      {
         "question-split":"train",
         "text":"Which classes do n't have any labs ?",
         "variables":{
            "department0":""
         }
      },
      {
         "question-split":"train",
         "text":"Which classes do not have lab components ?",
         "variables":{
            "department0":""
         }
      },
      {
         "question-split":"train",
         "text":"Which classes do not require a lab session ?",
         "variables":{
            "department0":""
         }
      },
      {
         "question-split":"train",
         "text":"Which of the classes do n't have lab sessions ?",
         "variables":{
            "department0":""
         }
      }
   ],
   "sql":[
      "SELECT DISTINCT COURSEalias0.NAME , COURSEalias0.NUMBER FROM COURSE AS COURSEalias0 WHERE COURSEalias0.DEPARTMENT = \"department0\" AND COURSEalias0.HAS_LAB = \"N\" AND ;"
   ],
   "variables":[
      {
         "example":"EECS",
         "location":"sql-only",
         "name":"department0",
         "type":"department"
      }
   ]
}
jkkummerfeld commented 6 years ago

1) "which ..." vs. "are there..." - this was something we wrestled with in the dataset creation. In the end we settled on the idea that if someone asked us "are there any classes...", we would not simply answer 'yes', we would say, 'yes, there is ...", making these equivalent. The argument could definitely be made the other way though. The data should be consistent on this interpretation.

2) Dangling AND - good catch, we did test all of the queries at some point, so either we broke this after that :( or SQL didn't complain. I'll make a fix.

3) sql-only - these can be thought of as default values. The long-term vision was that we would have profiles associated with questions ("this is a question from a 1st year student in 2018") that give context that is necessary for correct SQL generation, but we didn't get to it.

4) 2016 vs 2017 - Hm, I thought we had caught this. The intention was to have the date set at a fixed point in time, with everything being consistent relative to that. I'll add this to the list of known issues and try to get to it. I'll leave this issue open too in the meantime.

DeNeutoy commented 6 years ago

Thanks for the comprehensive answer (and sorry for all these issues i'm raising!).

jkkummerfeld commented 6 years ago

Quite the contrary - thank you for bringing things to our attention!

One of my hopes for this dataset is that it is not static the way many in NLP are. I suspect many other people came across some of the same bugs we saw in GeoQuery, ATIS, etc, but fixing corpus bugs is not a standard part of the academic process, so they didn't get fixed, which is a shame.

jkkummerfeld commented 5 years ago

I've now fixed the dangling AND and looked into the 2016 v 2017 question. All the 2017 cases are when the query asks about "next Winter", which means Winter 2017. It's not immediately clear that this is the case because 'Winter' is listed as a variable in the questions (so it shows up as "next semester0").