Update transformers, move to fast tokenizers, support large configurations of encoders.

grammarly / gector

Official implementation of the papers "GECToR – Grammatical Error Correction: Tag, Not Rewrite" (BEA-20) and "Text Simplification by Tagging" (BEA-21)

Apache License 2.0

900 stars 214 forks source link

Update transformers, move to fast tokenizers, support large configurations of encoders. #120

Closed MaksTarnavskyi closed 3 years ago

MaksTarnavskyi commented 3 years ago

The main changes:

update transformers to version 4.2.2
use fast tokenizers instead of the previous custom implementation
support large configuration of encoders: bert-large, roberta-large, xlnet-large

skurzhanskyi commented 3 years ago

Hey @MaksTarnavskyi Huge thanks for such a giant piece of work! I'm starting PR review.

skurzhanskyi commented 3 years ago

Overall, the code looks good to me. But new tokenization doesn't work for our previous pretrained models. I got about F_0.5 28.11 for CoNNL-2014 (test) for our BERT model. At the same time, this tokenization is a more proper one (as mentioned in https://github.com/grammarly/gector/issues/50). We'll try to reproduce the pipeline with your codebase and we'll release this PR along with the new models.

skurzhanskyi commented 3 years ago

We'll push in a temporary branch for now to add our updates.