Advice about training with additional synthetic dataset

grammarly / gector

Official implementation of the papers "GECToR – Grammatical Error Correction: Tag, Not Rewrite" (BEA-20) and "Text Simplification by Tagging" (BEA-21)

Apache License 2.0

894 stars 216 forks source link

Advice about training with additional synthetic dataset #114

Closed rachelwrr closed 3 years ago

rachelwrr commented 3 years ago

Hi,

Thanks for the work!

Just seeking for advice. If I want to feed in with additional synthetic data set targeting a few specific grammar errors, what order will you recommend me to train the model? Will mixing up the order of 3 training stages affect the result?

Fine tune on the top of your pretrained model (after Stage 3)? Or Restart the training process, and include those new dataset in Stage 1?

I'm new in this area. Any advice will be appreciated :)

Thanks!

skurzhanskyi commented 3 years ago

Hi I think this depends on how much your errors differ from those in the dataset. In general, I would suggest adding these errors to Stage 1 and then applying Stage 2 & 3, as your data is synthetic.

rachelwrr commented 3 years ago

errors to Stage 1 and then applying Stage 2 & 3, as your data is synthetic.

Thanks for the reply! For dataset, I took 60000 sentences from PIE folder a5 (true), then convert adj to adv, intending to improve adj. / adv. conversion related grammar errors.