Closed nmallinar closed 4 years ago
All the models were trained on the standard splits and only on train. For the datasets that provide multiple folds (SQA, WTQ) we used the first one. The respective dev sets were used for hyper-parameter selection / tuning and the test sets were only used to report the numbers in the paper.
It's also surprising that the models would be good at retrieval since they have been trained with the QA objective.
Ah, well the QA models themselves are not good at retrieval. I took the base model and I put a retrieval objective on top instead of QA and fine-tuned the retriever from the SQA base.
I have validated quite a few variations that I thought would be a problem point and ensured that there is no chance of test data coming in, all that was left in the air on my end was in the masked LM pre-training and whether it might have seen and memorized question/table pairs from the test set. Or is it possible even just the test set tables made their way into the pre-training cycle?
Anyway, I see that you've released the pre-training data so I'll download that and do some checks on my end.
Thanks again for the prompt responses!
During LM pre-training we don't use the SQA question so there is no risk of over-fitting / leaking there. (We use text snippets occurring on the page as a placeholder for the question.)
There is a good chance that the test tables will be seen at pre-training time since we train on all Wikipedia tables. I think that shouldn't be an issue, though.
Hey @nmallinar, would you mind explaining how you implented the retriever in more detail? Thanks!
Hi!
I have worked a bit on building a retrieval setup into the TaPas codebase, and I fine-tuned a model on WikiTableQuestions random-split-1-train using the SQA base to perform retrieval and am using random-split-1-dev and pristine-unseen-tables test sets for evaluating to retrieve from the set of all tables.
The results are quite good, but I started analyzing specific examples of questions where a standard BERT model was unable to retrieve but TaPas was and I am finding that TaPas is almost always finding the correct table even for questions that are incredibly vague in the test set (e.g. "Who came first?" which should likely retrieve with high confidence any table regarding scoring or sports but somehow manages to find exactly the correct table).
What I notice is that the score distribution across all tables also is extremely sharp, with very high confidence TaPas is almost always finding the right table. I logged all of my query IDs and table IDs in the new *.tfrecord files that I created to train my retrieval model and I am positive there is no test set data leak on my end. Furthermore I checked that SQA datasets respect the data splits of the original WTQ dataset. Which leaves two questions:
1) Was the SQA train + test data mixed and used in re-training the final SQA Base model that is released? 2) Is it possible that query-table pairs from the pristine-unseen-tables test set found their way into the pre-training cycle?
Thanks a lot for your help!