Closed yairf11 closed 5 years ago
Thank you for pointing this out! We just took a look at the shortest questions and indeed found the same questions you mentioned. As far as we know:
At this point, I think you could sort the examples by question length and remove the shortest 100 before training your model. We will release the next version of the training set very soon to exclude these examples.
We just released v1.1 here http://curtis.ml.cmu.edu/datasets/hotpot/hotpot_train_v1.1.json We removed 117 questions of this kind from the training set v1.
I also find an example in dev set:
{
"_id": "5ae61bfd5542992663a4f261",
"answer": "swingman",
"question": "Which teams did Jimmy Butler play and what role did he play on these teams?",
"supporting_facts": [
[
"Shooting guard",
4
],
[
"Shooting guard",
5
],
[
"Jimmy Butler (basketball)",
0
],
[
"Jimmy Butler (basketball)",
902
]
Note that 902
is a large number, there's no such sentence in the document.
Hi,
I have been looking through your datasets, and found something odd - in the training set, there are questions that seem broken / missing. For example, sample id
5a775ea9554299373536024d
holds the question 'w', and sample id5a81265c5542995ce29dcbca
holds the question 'DRM'. There are several more.The easiest way to find these examples is by sorting the questions in the training set by length, and then looking at the shortest ones. A simple workaround could be to discard all questions with no question mark, but this eliminates 2322 samples, some of them perfectly good questions.
Are you aware of this? Thanks!