Closed DeNeutoy closed 5 years ago
1) "which ..." vs. "are there..." - this was something we wrestled with in the dataset creation. In the end we settled on the idea that if someone asked us "are there any classes...", we would not simply answer 'yes', we would say, 'yes, there is ...", making these equivalent. The argument could definitely be made the other way though. The data should be consistent on this interpretation.
2) Dangling AND - good catch, we did test all of the queries at some point, so either we broke this after that :( or SQL didn't complain. I'll make a fix.
3) sql-only - these can be thought of as default values. The long-term vision was that we would have profiles associated with questions ("this is a question from a 1st year student in 2018") that give context that is necessary for correct SQL generation, but we didn't get to it.
4) 2016 vs 2017 - Hm, I thought we had caught this. The intention was to have the date set at a fixed point in time, with everything being consistent relative to that. I'll add this to the list of known issues and try to get to it. I'll leave this issue open too in the meantime.
Thanks for the comprehensive answer (and sorry for all these issues i'm raising!).
Quite the contrary - thank you for bringing things to our attention!
One of my hopes for this dataset is that it is not static the way many in NLP are. I suspect many other people came across some of the same bugs we saw in GeoQuery, ATIS, etc, but fixing corpus bugs is not a standard part of the academic process, so they didn't get fixed, which is a shame.
I've now fixed the dangling AND and looked into the 2016 v 2017 question. All the 2017 cases are when the query asks about "next Winter", which means Winter 2017. It's not immediately clear that this is the case because 'Winter' is listed as a variable in the questions (so it shows up as "next semester0").
In the advising dataset, there are a few examples which raise some questions.
Some of the questions paired with a single SQL query do not have equivalent semantics (no problem if this is just part and parcel of building a dataset, I can imagine there is some noise.) An example of this are the first two questions below, where you would expect one to return a list of classes which don't have lab sessions, whereas the other you would expect to return a boolean value. Is this a broad distinction that is not drawn in the standardisation of the datasets, or is this just an error?
This query has a dangling "AND" statement, which I don't think is valid SQL?
What is the significance of "sql-only" variables? Are these variables for which there is only one possible value, or that there is a default value?
There exist some variables in the query which appear to be default values, but which differ in value and are not extracted as sql variables (perhaps this is just an issue with the automated extraction and there's nothing you can do about it). For instance, if you search for "2016" and "2017" in the dataset, they are used interchangeably for queries containing "next year/next semester". Do you have a recommendation for how to treat these?