google-research-datasets / clang8

cLang-8 is a dataset for grammatical error correction.
100 stars 5 forks source link

Data Format For GEC #13

Open saramoeini20 opened 1 year ago

saramoeini20 commented 1 year ago

Hi, I'm working in GEC for a low resource language and wanted to create datasets myself. I have some question if you can answer i will be thankful. 1) I saw training data is in parallel file format. So Should evaluating data be in M2 format? And M2 format is just for evaluating in GEC?

2) If i want to create feedback on error or show the location of the error in GEC, is parallel file format still usable or i should change the format?

3) And what approach you suggest for training model for a low resource language? Can i get help from your model in paper "A Simple Recipe for Multilingual Grammatical Error Correction"?