soyoung97 / Standard_Korean_GEC

57 stars 4 forks source link

Data Format #3

Closed saramoeini20 closed 1 year ago

saramoeini20 commented 1 year ago

Hi, I have a question regarding training and test data. Actually i have seen both M2 format and parallel file format for GEC tasks. Can you please guide me that which format is used in which situation?

soyoung97 commented 1 year ago

Hi! The parallel file format is used to train models with the data. The M2 format is needed to evaluate the output of system&models. Please refer to https://github.com/nusnlp/m2scorer for the M2 format!

saramoeini20 commented 1 year ago

So for training the model regardless of which approach we select, we need parallel file format? And by saying human-annotated data we mean M2 format?

saramoeini20 commented 1 year ago

And because I want to do GEC for a low resource language, I should create dataset myself. So for reaching something like M2 format what should i do? I mean i saw something like Errant but it was for English. how have you done that for your language? Should i modify Errant?

soyoung97 commented 1 year ago

I'm not quite sure if I got what you mean. Is it correct that you want the following?

  1. You want to create GEC dataset for low-resource language.
  2. You want to create an M2-file format from that dataset.

If this is true, here are my answers:

  1. You should create something that has (1) text with grammatical errors, and (2) text that fixes these errors. This is why it is called as a "parallel" dataset.
  2. Making a correct form of M2 file format considering the linguistics of your non-English language is not trivial. You would need to have some modifications to do it. Please have a look at other papers for different languages with open-sourced codes, for example looking at papers citing ERRANT (https://www.semanticscholar.org/paper/Automatic-Annotation-and-Evaluation-of-Error-Types-Bryant-Felice/4cac1e1eb876ffdcfda5a62d5237f942b519a502), and looking at the ERRANT code. For me, the system that converts parallel corpora into m2 file format and assign correct error types is called "KAGAS". The code to convert it is mainly in two codes, which is in the following:
  3. parllel_to_m2
  4. align_text_korean Naively, it would be possible to directly use ERRANT without any modifications (I don't know if the Spacy tokenizer used at ERRANT correctly tokenizes other languages, though). But, they will classify tokens and assign scores to it by the rules of English (they use the english lemmatizer and so on.), so you would need to modify errant to fit into your language for an accurate M2 file (If you are going to use this for evaluation). I hope this information helps!
saramoeini20 commented 1 year ago

It helped me. Thank you so much.

soyoung97 commented 1 year ago

Since it seems like it's solved, changing the status to closed!