Test leakage in datasets which use cross-validation and `query-split=True`

DeNeutoy commented 6 years ago

I believe there is a test set leakage in the case that the dataset requires the use of cross validation ( Academic, IMDB, Restaurant and Yelp datasets) and the args.query_split flag is set to True.

This bug affects the baseline script due to this function copied verbatim from https://github.com/jkkummerfeld/text2sql-data/blob/master/systems/baseline-template/text2sql-template-baseline.py#L114 with comments added to demonstrate the problem:

def get_tagged_data_for_query(data):

    # In the case that we are using a dataset which uses cross validation,
    # `dataset` will be a numeric value.
    dataset = data['query-split']
    for sent_info in data['sentences']:
        if not args.query_split:
            dataset = sent_info['question-split']

        # args.split specifies what split we should use for cross validation.
        if args.split is not None:
            # Here is the problematic section - on the first iteration,
            # this check does the correct thing, because both "args.split" and
            # `dataset` are numeric values. However, on following iterations,
            # dataset becomes equal to either `train` or `test` (because it is set
            # to either of these values below in the first loop), meaning for
            # all but the first sentence in the query split, it will be incorrect
            # (i.e the split the datapoint gets assigned to will be "train").
            if str(args.split) == str(dataset):
                dataset = "test"
            else:
                dataset = "train"

        for sql in data['sql']:
            sql_vars = {}
            for sql_var in data['variables']:
                sql_vars[sql_var['name']] = sql_var['example']
            text = sent_info['text']
            text_vars = sent_info['variables']

            yield (dataset, insert_variables(sql, sql_vars, text, text_vars))

            if not args.use_all_sql:
                break

Note that this bug doesn't affect the results for the question split, because it is guarded against here, meaning the dataset variable is reset correctly each time.

As an example, this following JSON blob (with --query-split=True and --split=1) will get split across the test(first sentence) and training set(second two sentences), when it should only be in the test set. This is a particularly tricky problem, because some of the sentences which have the same SQL query are actually identical, meaning that the test split essentially contains duplicates from the training split.

    {
        "comments": [],
        "old-name": "",
        "query-split": "1",
        "sentences": [
            {
                "question-split": "0",
                "text": "how many name0 are there in city_name0 ?",
                "variables": {
                    "city_name0": "san francisco",
                    "name0": "buttercup kitchen"
                }
            },
            {
                "question-split": "6",
                "text": "how many name0 are there in city_name0 ?",
                "variables": {
                    "city_name0": "san francisco",
                    "name0": "buttercup kitchen"
                }
            },
            {
                "question-split": "6",
                "text": "how many name0 are there in city_name0 ?",
                "variables": {
                    "city_name0": "san francisco",
                    "name0": "buttercup kitchen"
                }
            }
        ],
        "sql": [
            "SELECT COUNT( * ) FROM LOCATION AS LOCATIONalias0 , RESTAURANT AS RESTAURANTalias0 WHERE LOCATIONalias0.CITY_NAME = \"city_name0\" AND RESTAURANTalias0.ID = LOCATIONalias0.RESTAURANT_ID AND RESTAURANTalias0.NAME = \"name0\" ;"
        ],
        "variables": [
            {
                "example": "san francisco",
                "location": "unk",
                "name": "city_name0",
                "type": "city_name"
            },
            {
                "example": "buttercup kitchen",
                "location": "unk",
                "name": "name0",
                "type": "name"
            }
        ]
    },

Fortunately, I think this bug actually reinforces the point of the paper - the validation scores are only going to fall for the datasets for which this happens for, which are as you note in section 5.4 of your paper:

"For the three datasets developed by the databases community, the effect of question-querysplit is far less pronounced."

Possibly, this bug is an underlying cause of this interpretation, given that these 3 datasets are all ones which use cross-validation.

This is a great resource btw - currently we are working on adding it to allennlp to do some type constrained decoding based off of a SQL grammar. If you are interested, we'd be happy to accept PRs to get your baseline into allennlp, where it would be tested and reviewed etc.

jkkummerfeld commented 6 years ago

Thanks for catching this! I'll definitely work on a fix and follow up (but it might take a few days).

On adding it to allennlp, is there a place I should look for guidance on structuring a contribution? (coding style, etc)

jkkummerfeld commented 6 years ago

Oh, and one quick note - the evaluation for the non-baseline systems was totally independent of this, so those numbers should not be affected by this bug.

DeNeutoy commented 6 years ago

Great! should be easy to fix, this snippet should do the trick:

        if args.split is not None:
            # No mutation of `dataset` fixes things
            if str(args.split) == str(dataset):
                dataset_to_use = "test"
            else:
                dataset_to_use = "train"
        else:
            dataset_to_use = dataset_split

        ...
        yield (dataset_to_use, insert_variables(sql, sql_vars, text, text_vars))

R.e contributing to allennlp - I will send you details once I have merged in some utility functions which will help with reading the data.

jkkummerfeld commented 6 years ago

Thanks, I ended up going with a simpler fix (see PR #11) and it looks like there were no substantial changes in results.

jkkummerfeld / text2sql-data

Test leakage in datasets which use cross-validation and `query-split=True` #10