Fix QA task preprocessing

RKorzeniowski commented 3 years ago

Hi, very cool lib. Just wanted to say that pre_process_squad function is not working correctly when following docs. There are two problems when nlp package is used like that nlp.load_dataset('squad_v2').

Column names differ, to be exact "anwsers" and "anwser_text".
Answers are given in dict(list(str)) format and tokenization that sets end and start token targets works as if it was dict(str). This ends up setting all targets as (0,0).

I had to fix that for my usecase so if you want I can make a PR with fixes. Let me know if there are things that I should do before like running tests

ohmeow commented 3 years ago

Yah if you want to make a PR go for it.

The project is built of on nbdev and so the process for developing and submitting PRs is the same as for libraries like fastai. See https://docs.fast.ai/dev-setup.

In particular, make sure you run nbdev_install_git_hooks right after you git clone the library. If you want to add some tests that would be great too. Check out the nbdev docs for how to do that and work on any project based on it: https://nbdev.fast.ai/.

Thanks and lmk if you have any questions.

-wg

On Sun, Nov 8, 2020 at 12:59 AM RKorzeniowski notifications@github.com wrote:

Hi, very cool lib. Just wanted to say that pre_process_squad https://github.com/ohmeow/blurr/blob/master/blurr/data/question_answering.py function is not working correctly when following docs https://ohmeow.github.io/blurr/modeling-question-answering/. There are two problems when huggingface datasets (updated nlp package) is used like that nlp.load_dataset('squad_v2') https://huggingface.co/docs/datasets/package_reference/loading_methods.html .

column names differ, to be exact "anwsers" and "anwser_text".

answers are given in dict(list(str)) format and tokenization that sets end and start token targets works as if it was dict(str). This ends up setting all targets as (0,0). I had to fix that for my usecase so if you want I can make a PR with fixes. Let me know if there are things that I should do before like running tests

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ohmeow/blurr/issues/19, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADNMAON377IMLFRDBSPQTSOZMVNANCNFSM4TOGA72A .

ohmeow commented 3 years ago

I think this is fixed now so I'm closing it out. If you're still seeing issues, feel free to reopen.

ohmeow / blurr

Fix QA task preprocessing #19