jkkummerfeld / text2sql-data

A collection of datasets that pair questions with SQL queries.
http://jkk.name/text2sql-data/
Other
534 stars 105 forks source link

Adding data from Spider #19

Closed jkkummerfeld closed 5 years ago

jkkummerfeld commented 5 years ago

Modifying data from https://yale-lily.github.io/spider to be in our format.

nweir127 commented 5 years ago

Any updates on this thread?

jkkummerfeld commented 5 years ago

Not yet. We won't be able to get to it until after the NAACL deadline (December 10th) at least - too many other projects keeping us busy!

Note also that we will not be able to release their test set (since it is kept private), so it will remain necessary to run using their evaluation system.

jkkummerfeld commented 5 years ago

This is now in the data development branch:

https://github.com/jkkummerfeld/text2sql-data/tree/data-development

Please let us know if you spot any issues with the conversion!

nweir127 commented 5 years ago

do you mind uploading the modified canonicaliser for spider?

jkkummerfeld commented 5 years ago

Done, it's in the tools directory. The changes are somewhat hacky, but get the job done.

nweir127 commented 5 years ago

looks like a lot of desc limit 1s got anonymized to desc limit var0. Is this by design to account for other limit numbers? It only seems to have happened for the spider dataset.

jkkummerfeld commented 5 years ago

Thanks! That was an oversight in variable identification. For spider I did it automatically, making all numbers variables, whereas in the others I did it partially manually (since there were a small enough number of queries to make checking feasible).

I've pushed a commit that fixes this.

nweir127 commented 5 years ago

Another quick question-- in your canonicalisation, if a column name matches its table name, should it also receive an alias?

e.g. SELECT institution , LOCATION FROM institution WHERE founded > 1990 AND TYPE = 'Private' became SELECT INSTITUTIONalias1.INSTITUTION AS INSTITUTIONalias0 , INSTITUTIONalias1.LOCATION FROM INSTITUTION AS INSTITUTIONalias1 WHERE INSTITUTIONalias1.FOUNDED > var0 AND INSTITUTIONalias1.TYPE = \"var1\" ;

jkkummerfeld commented 5 years ago

No, that's also a bug, now fixed.

Looking at the previous datasets, it was extremely rare to have a table and field with the same name, and they were often in rarely used parts of the schema.

nweir127 commented 5 years ago

also looks like the wrong aliases are chosen at times-- table names are identified as other tables' columns with the same name.

SELECT name FROM physician EXCEPT SELECT T2.name FROM appointment AS T1 JOIN physician AS T2 ON T1.Physician = T2.EmployeeID

SELECT PHYSICIANalias0.NAME FROM APPOINTMENTalias0.PHYSICIAN AS PHYSICIANalias0 SELECT PHYSICIANalias1.NAME FROM APPOINTMENT AS APPOINTMENTalias0 JOIN PHYSICIAN AS PHYSICIANalias1 ON PHYSICIANalias1.EMPLOYEEID = APPOINTMENTalias0.PHYSICIAN ;

jkkummerfeld commented 5 years ago

Hm, that's going to be trickier to resolve. I'll try to investigate it sometime this week.

jkkummerfeld commented 5 years ago

@nweir127 it took a little longer to get to than expected, but I think I've resolved this (committed to data-development in e76b98b1a79b7a28ec5df0e895e4c9e6ffe9cb84 ).

The problem was that we didn't handle EXCEPT SELECT and INTERSECT SELECT correctly (not something that has come up in other datasets).

jkkummerfeld commented 5 years ago

Closing as we have now merged in the branch that contains Spider.