jkkummerfeld / text2sql-data

A collection of datasets that pair questions with SQL queries.
http://jkk.name/text2sql-data/
Other
534 stars 105 forks source link

The number of data does not match that in the paper #51

Closed LIANGQINGYUAN closed 3 years ago

LIANGQINGYUAN commented 3 years ago

In the paper "improving text to SQL evaluation methodology", the number of data in GeoQuery is 880, the number of data in ATIS is 5418 and the number of data in Scholar is 816. Paper: https://www.aclweb.org/anthology/P17-1089.pdf

But it's all slightly different from the number of the paper and the dataset you have. Paper: https://www.aclweb.org/anthology/P18-1033.pdf In the Question Count column of Table 2.

Why is that? Thank you!

LIANGQINGYUAN commented 3 years ago

Oh, I saw you use the deduplication process.

But why does the number of data in scholar increase from 816 to 817?

jkkummerfeld commented 3 years ago

That's right that deduplication accounts for most of the differences. On the 816 vs. 817, I think that is a mistake in the original paper. Their data is here:

Dev (100) - https://github.com/sriniiyer/nl2sql/blob/master/data/scholar/scholar_dev.nl Test (218) - https://github.com/sriniiyer/nl2sql/blob/master/data/scholar/scholar_test.nl Train (499) - https://github.com/sriniiyer/nl2sql/blob/master/data/scholar/scholar_train.nl Which adds up to 817

It's also possible that something changed between when they wrote the paper and when they finalised the data.

LIANGQINGYUAN commented 3 years ago

OK, I got it. Thank you for your reply!