jkkummerfeld / text2sql-data

A collection of datasets that pair questions with SQL queries.
http://jkk.name/text2sql-data/
Other
534 stars 105 forks source link

Anonymised Variables should have consistent naming corresponding to their column #25

Open DeNeutoy opened 5 years ago

DeNeutoy commented 5 years ago

It's a little annoying that the anonymised variable names sometimes but not always correspond to the table/column name they come from. E.g in some datasets like academic, the variable name is derived from the column name:

        "sql": [
            "SELECT JOURNALalias0.HOMEPAGE FROM JOURNAL AS JOURNALalias0 WHERE JOURNALalias0.NAME = \"journal_name0\" ;"
        ],
        "variables": [
            {
                "example": "PVLDB",
                "location": "both",
                "name": "journal_name0",
                "type": "journal_name"
            }
        ]

whereas in geography, variables are named var1, from which you cannot directly infer their type from either the name or the type key.

        "sql": [
            "SELECT CITYalias0.CITY_NAME FROM CITY AS CITYalias0 WHERE CITYalias0.POPULATION = ( SELECT MAX( CITYalias1.POPULATION ) FROM CITY AS CITYalias1 WHERE CITYalias1.STATE_NAME = \"var0\" ) AND CITYalias0.STATE_NAME = \"var0\" ;"
        ],
        "variables": [
            {
                "example": "arizona",
                "location": "both",
                "name": "var0",
                "type": "state"
            }
        ]
jkkummerfeld commented 5 years ago

I've worked on addressing this in #27 have a look and let me know what you think!

jkkummerfeld commented 5 years ago

Hm, though having something that directly maps to the DB for all cases is trickier. More thought required.

DeNeutoy commented 5 years ago

Hmm yeah I also found this once I dug into it more - e.g the limit0 variables in the scholar dataset are really a function of the query rather than particular to the database. What you've done for geography looks like an improvement though!

jkkummerfeld commented 5 years ago

I've merged that for now, but will keep this open as a reminder that this issue requires more work. My thinking is that I could do the following:

That would be an improvement over the current state, though would also be a fair amount of work.