Questions about processing synthetic data

grammarly / gector

Official implementation of the papers "GECToR – Grammatical Error Correction: Tag, Not Rewrite" (BEA-20) and "Text Simplification by Tagging" (BEA-21)

Apache License 2.0

891 stars 216 forks source link

Questions about processing synthetic data #154

Closed liangnn17 closed 2 years ago

liangnn17 commented 2 years ago

Hi,

I noticed that the tokenization method in PIE data is different from the nucle and fce data you used. I'm wondering whether I need to detokenize the PIE data and use spacy to do tokenization on my own.

Looking forward to your advice!

mina1460 commented 2 years ago

no, you don't need to tokenize it yourself. You can use the script they provided for preprocessing in order to get the data ready in a compatible format for gector.

skurzhanskyi commented 2 years ago

Hi @liangnn17 The tokenization for PIE indeed may be a bit different from the one used in BEA data, but I think it wouldn't influence the quality significantly.