nyu-dl / dl4ir-searchQA

BSD 3-Clause "New" or "Revised" License
178 stars 18 forks source link

Extra questions on the cleaned dataset #3

Open ArthurCamara opened 5 years ago

ArthurCamara commented 5 years ago

Dears, according to the paper, the dataset should contain 140,461 questions. However, the clean dataset contains 500 more than proposed. Why is that? Is there any further cleaning that must be completed on the dataset? This is the code I'm using for getting to this number:

questions = []
for _f in ['train.txt', 'test.txt', 'val.txt']:
    for line in open(_f):
        p_line = line.strip().split("|||")
        question = p_line[1].strip()
        questions.append(question)
print(len(questions))
>> 140961

By using a set, instead of a list (cleaning duplicate questions), the number is 140,416