Extra questions on the cleaned dataset

Dears, according to the paper, the dataset should contain 140,461 questions. However, the clean dataset contains 500 more than proposed. Why is that? Is there any further cleaning that must be completed on the dataset? This is the code I'm using for getting to this number:

questions = []
for _f in ['train.txt', 'test.txt', 'val.txt']:
    for line in open(_f):
        p_line = line.strip().split("|||")
        question = p_line[1].strip()
        questions.append(question)
print(len(questions))
>> 140961

By using a set, instead of a list (cleaning duplicate questions), the number is 140,416

nyu-dl / dl4ir-searchQA

Extra questions on the cleaned dataset #3