grammarly / gector

Official implementation of the papers "GECToR – Grammatical Error Correction: Tag, Not Rewrite" (BEA-20) and "Text Simplification by Tagging" (BEA-21)
Apache License 2.0
900 stars 214 forks source link

There seems to be no pre-processor for the stage 2 and 3 data in m2 format #107

Closed abhinavdayal closed 3 years ago

abhinavdayal commented 3 years ago

It would be nice to know in more detail about the fine tuning stages. For example, I would like to fine tune this model on a specific domain. Is fine tuning same as training, i.e. we call train.py methods? The training require data to be in format of input and output sentences from where a pre processor converts it into delimited tokens and tags, which then is tokenized, instance created and fed to the embedder. The codebase does not seem to have a way to take in m2 files and bring them into trainable format.

arvindpdmn commented 3 years ago

There's another issue that gives the command for stage 2. I wonder if this gives a clue: https://github.com/grammarly/gector/issues/42

For stage 3, see https://github.com/grammarly/gector/issues/11

skurzhanskyi commented 3 years ago

You can convert the m2 format to the parallel one, then convert it to ours. In general, M2 doesn't suit here as there may be multiple annotators.