Start_index - Githubissues

Marwa199527 commented 3 years ago

Thank you @kushalj001 for your great work, it really helped me. What I want to ask you and I really hope that you reply to me as soon as possible. Am using an Arabic data set which is in squad format there are almost 2300 answers were dropped because they have a wrong start index which is a lot. 1) How can I fix the start index in my data. 2) I did not understand how your function found these errors in the start index. 3) And I know that you explain why this problem happened, but I did not really understand the cause of it. Please answer the above questions, I will be really grateful if you did.

kushalj001 commented 3 years ago

I'll show the cause of the error with an example. Note that this is specific to the English dataset and the tokenizer that I am using in my notebooks. I am using a spacy tokenizer, which does not do anything fancy. The following is an erroneous example from the english dataset which was consequently dropped before training. If you see carefully, the answer to the question given the context is 1854. Now the spans capture that perfectly, i.e. context[92:96] = 1854. However after the tokenization process, the tokens are as follows, The calculated tokens do not capture 1854 separately because it is immediately followed by a hyphen. Hence the end index of our calculated span does not match the ground truth span. Most of the errors are due to issues like these.
You can look for your erroneous examples manually and try to fix them. More often than not, a large number of these examples would follow a similar pattern and could be fixed by some manipulation of the context. But at the same time, remember that if you make some changes to your context text, you will also have to accordingly update the ground truth answer spans.

Marwa199527 commented 3 years ago

Thank you very much @kushalj001

kushalj001 / pytorch-question-answering

Start_index #8