Data format - Githubissues

s-ankur / hindi_grammar_correction

Hindi Grammar Correction

3 stars 2 forks source link

Data format #2

Closed saramoeini20 closed 1 year ago

saramoeini20 commented 1 year ago

Hi, I'm working in GEC for a low resource language and wanted to create datasets myself. I have some question if you can answer i will be thankful. 1) Regarding training dataset i saw you yourself has created artificial data. Are they in parallel file format? and if yes, for training models in whatever approach we select , we need this format and not M2 format?

2) For evaluating you used M2 format but couldn't understand that you used Errant and modified it to your language or from start wrote something like Errant? Because i don't think that Errant support my language.

3) And what approach you suggest for training model for a low resource language? Can i get help from your model?

s-ankur commented 1 year ago

Hi, yes the artificial data is in a parallel format and not m2. It is stored as src and trg files (ie 1st line of src is parallel to 1st line of trg). We also have tgt file which is just a copy of the trg file. Typically for training any kind of model all you need is src trg.

M2 format is used for eval, the reason that it's different is that m2 is created manually by looking at the errors in the test and hand annotation to create ground truth. It's not necessary to have m2 files for eval but it gives a better estimate of the score. You can look up m2scorer documentation but it's completely optional.

I would suggest starting with error-creating augmentation and using the multilingual bert (mbert) pretrained model

s-ankur commented 1 year ago

https://huggingface.co/bert-base-multilingual-cased

Click on use in transformers to get a code sample. As long as your language has a relative in MBert you should have decent performance

saramoeini20 commented 1 year ago

By saying "It's not necessary to have m2 files for eval", you meant i can later do evaluation with some simple format like source/target file? And how exactly can i make m2 manually like you mentioned? Did you mean i should annotate each sentence one by one? Can you please share a good link for error-creating augmentation? because i couldn't understand what it is exactly

s-ankur commented 1 year ago

Yes you can, m2 files are not essential.

You can take a look at the documentation for m2scorer but I would recommend just to skip it altogether

My paper does have a few citations about the many ways we can do error creating annotation.

If you would like to talk further you can send me an invite on s.ankursonawane@gmail.com for Google Meet sometime this week

saramoeini20 commented 1 year ago

I will reach you if I have any questions, for now your explanation helped. Thank you so much.