Dears,
according to the paper, the dataset should contain 140,461 questions. However, the clean dataset contains 500 more than proposed. Why is that? Is there any further cleaning that must be completed on the dataset? This is the code I'm using for getting to this number:
questions = []
for _f in ['train.txt', 'test.txt', 'val.txt']:
for line in open(_f):
p_line = line.strip().split("|||")
question = p_line[1].strip()
questions.append(question)
print(len(questions))
>> 140961
By using a set, instead of a list (cleaning duplicate questions), the number is 140,416
Dears, according to the paper, the dataset should contain 140,461 questions. However, the clean dataset contains 500 more than proposed. Why is that? Is there any further cleaning that must be completed on the dataset? This is the code I'm using for getting to this number:
By using a set, instead of a list (cleaning duplicate questions), the number is 140,416