This is one of the implementation of the following paper:
@inproceedings{omelianchuk-etal-2020-gector,
title = "{GECT}o{R} {--} Grammatical Error Correction: Tag, Not Rewrite",
author = "Omelianchuk, Kostiantyn and
Atrasevych, Vitaliy and
Chernodub, Artem and
Skurzhanskyi, Oleksandr",
booktitle = "Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications",
month = jul,
year = "2020",
address = "Seattle, WA, USA β Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.bea-1.16",
doi = "10.18653/v1/2020.bea-1.16",
pages = "163--170"
}
Confirmed that it works on python3.11.0.
pip install -r requirements.txt
# Donwload the verb dictionary in advance
mkdir data
cd data
wget https://github.com/grammarly/gector/raw/master/data/verb-form-vocab.txt
python predict.py \
--input <raw text file> \
--restore_dir gotutiyan/gector-roberta-base-5k \
--out <path to output file>
from transformers import AutoTokenizer
from gector import GECToR, predict, load_verb_dict
model_id = 'gotutiyan/gector-roberta-base-5k'
model = GECToR.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
encode, decode = load_verb_dict('data/verb-form-vocab.txt')
srcs = [
'This is a correct sentence.',
'This are a wrong sentences'
]
corrected = predict(
model, tokenizer, srcs,
encode, decode,
keep_confidence=0.0,
min_error_prob=0.0,
n_iteration=5,
batch_size=2,
)
print(corrected)
--from_official
and related options starting with --official.
.data/output_vocabulary
is in here
# An example to use official BERT model.
# Download the official model.
wget https://grammarly-nlp-data-public.s3.amazonaws.com/gector/bert_0_gectorv2.th
# Predict with the official model.
python predict.py \
--input <raw text file> \
--restore bert_0_gectorv2.th \
--out out.txt \
--from_official \
--official.vocab_path data/output_vocabulary \
--official.transformer_model bert-base-cased \
--official.special_tokens_fix 0 \
--official.max_length 80
GECToR.from_official_pretrained()
instead of GECToR.from_pretrained()
.from transformers import AutoTokenizer
from gector import GECToR, predict, load_verb_dict
model = GECToR.from_official_pretrained(
'bert_0_gectorv2.th',
special_tokens_fix=0,
transformer_model='bert-base-cased',
vocab_path='data/output_vocabulary',
max_length=80
)
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
encode, decode = load_verb_dict('data/verb-form-vocab.txt')
I performed experiments using this implementation. Trained models are also obtained from Hugging Face Hub.
Model | Confidence | Threshold | BEA19-dev (P/R/F0.5) | CoNLL14 (P/R/F0.5) | BEA19-test (P/R/F0.5) |
---|---|---|---|---|---|
BERT [Omelianchuk+ 2020] | 72.1/42.0/63.0 | 71.5/55.7/67.6 | |||
RoBERTa [Omelianchuk+ 2020] | 73.9/41.5/64.0 | 77.2/55.1/71.5 | |||
XLNet [Omelianchuk+ 2020] | 66.0/33.8/55.5 | 77.5/40.1/65.3 | 79.2/53.9/72.4 | ||
DeBERTa [Tarnavskyi+ 2022](Table 3) | 64.2/31.8/53.8 | ||||
gotutiyan/gector-bert-base-cased-5k | 0.4 | 0.6 | 64.5/30.0/52.4 | 73.0/33.6/59.1 | 76.8/48.7/68.9 |
gotutiyan/gector-roberta-base-5k | 0.5 | 0.0 | 65.8/31.8/54.2 | 74.6/35.7/61.3 | 78.5/51.0/70.8 |
gotutiyan/gector-xlnet-base-cased-5k | 0.5 | 0.0 | 67.2/30.7/54.3 | 77.2/34.4/61.8 | 78.8/49.9/70.7 |
gotutiyan/gector-deberta-base-5k | 0.4 | 0.3 | 64.1/34.5/54.7 | 73.7/38.8/62.5 | 76.0/54.2/70.4 |
Model | Confidence | Threshold | BEA19-dev (P/R/F0.5) | CoNLL14 (P/R/F0.5) | BEA19-test (P/R/F0.5) |
---|---|---|---|---|---|
RoBERTa [Tarnavskyi+ 2022] | 65.7/33.8/55.3 | 80.7/53.3/73.2 | |||
XLNet [Tarnavskyi+ 2022] | 64.2/35.1/55.1 | ||||
DeBERTa [Tarnavskyi+ 2022] | 66.3/32.7/55.0 | ||||
DeBERTa (basetag) [Mesham+ 2023] | 68.1/38.1/58.8 | 77.8/56.7/72.4 | |||
gotutiyan/gector-bert-large-cased-5k | 0.5 | 0.0 | 64.7/32.0/53.7 | 75.9/36.8/62.6 | 77.2/50.4/69.8 |
gotutiyan/gector-roberta-large-5k | 0.4 | 0.6 | 65.7/34.3/55.5 | 75.4/37.1/62.5 | 78.5/53.7/71.9 |
gotutiyan/gector-xlnet-large-cased-5k | 0.3 | 0.4 | 63.8/36.5/55.5 | 74.6/41.6/64.4 | 75.9/56.7/71.1 |
gotutiyan/gector-deberta-large-5k | 0.5 | 0.4 | 68.7/33.1/56.6 | 80.0/36.9/64.8 | 81.1/52.8/73.2 |
Model | CoNLL14 (P/R/F0.5) | BEA19-test (P/R/F0.5) | Note |
---|---|---|---|
BERT(base) + RoBERTa(base) + XLNet(base) [Omelianchuk+ 2020] | 78.2/41.5/66.5 | 78.9/58.2/73.6 | |
gotutiyan/gector-bert-base-cased-5k + gotutiyan/gector-roberta-base-5k + gotutiyan/gector-xlnet-base-cased-5k | 80.9/33.3/63.0 | 83.5/48.7/73.1 | The ensemble method is different from Omelianchuk+ 2020. |
RoBERTa(large, 10k) + XLNet(large, 5k) + DeBERTa(large, 10k) [Tarnavskyi+ 2022] | 84.4/54.4/76.0 | ||
gotutiyan/gector-roberta-large-5k + gotutiyan/gector-xlnet-large-cased-5k + gotutiyan/gector-deberta-large-5k | 81.7/37.0/65.8 | 84.0/53.4/75.4 |
Use official preprocessing code. E.g.
mkdir utils
cd utils
wget https://github.com/grammarly/gector/raw/master/utils/preprocess_data.py
wget https://raw.githubusercontent.com/grammarly/gector/master/utils/helpers.py
cd ..
python utils/preprocess_data.py \
-s <raw source file path> \
-t <raw target file path> \
-o <output path>
train.py
uses Accelerate. Please input your environment with accelerate config
in advance.
accelerate launch train.py \
--train_file <preprocess output of train> \
--valid_file <preprocess output of validation> \
--save_dir outputs/sample
The best and last checkpoints are saved. The format is:
outputs/sample
βββ best
β βββ added_tokens.json
β βββ config.json
β βββ merges.txt
β βββ pytorch_model.bin
β βββ special_tokens_map.json
β βββ tokenizer_config.json
β βββ tokenizer.json
β βββ vocab.json
βββ last
β βββ ... (The same as best/)
βββ log.json
The same usage of the Usage. You can specify best/
or last/
directory to --restore_dir
.
CLI
python predict.py \
--input <raw text file> \
--restore_dir outputs/sample/best \
--out <path to output file>
Or, to use as API,
from transformers import AutoTokenizer
from gector import GECToR
path = 'outputs/sample/best'
model = GECToR.from_pretrained(path)
tokenizer = AutoTokenizer.from_pretrained(path)
You can use --visualize
option to output a visualization of the predictions. It will be helpful for qualitative analyses.
For example,
echo 'A ten years old boy go school' > demo.txt
python predict.py \
--restore_dir gotutiyan/gector-roberta-base-5k \
--input demo.txt \
--visualize visualize.txt
visualize.txt
will show:
=== Line 0 ===
== Iteration 0 ==
|$START |A |ten |years |old |boy |go |school |
|$KEEP |$KEEP |$APPEND_- |$TRANSFORM_AGREEMENT_SINGULAR |$KEEP |$KEEP |$TRANSFORM_VERB_VB_VBZ |$KEEP |
== Iteration 1 ==
|$START |A |ten |- |year |old |boy |goes |school |
|$KEEP |$KEEP |$KEEP |$KEEP |$KEEP |$KEEP |$KEEP |$APPEND_to |$KEEP |
== Iteration 2 ==
|$START |A |ten |- |year |old |boy |goes |to |school |
|$KEEP |$KEEP |$KEEP |$KEEP |$APPEND_- |$KEEP |$KEEP |$KEEP |$KEEP |$KEEP |
A ten - year - old boy goes to school
To tweak two parameters in the inference, please use predict_tweak.py
.
The following example tweaks both of parameters in {0, 0.1, 0.2 ... 0.9}
. kc
is a keep confidence and mep
is a minimum error probability threshold.
python predict_tweak.py \
--input <raw text file> \
--restore_dir outputs/sample/best \
--kc_min 0 \
--kc_max 1 \
--mep_min 0 \
--mep_max 1 \
--step 0.1
This script creates <--restore_dir>/outputs/tweak_outputs/
and saves each output in it.
models/sample/best/outputs/tweak_outputs/
βββ kc0.0_mep0.0.txt
βββ kc0.0_mep0.1.txt
βββ kc0.0_mep0.2.txt
...
After that, you can determine the best parameters by doing the following:
RESTORE_DIR=${1}
for kc in `seq 0 0.1 0.9` ; do
for mep in `seq 0 0.1 0.9` ; do
# Refer to $RESTORE_DIR/outputs/tweak_output/kc${kc}_mep${mep}.txt in the evaluation scripts
done
done
wget https://github.com/MaksTarnavskyi/gector-large/raw/master/ensemble.py
python ensemble.py \
--source_file <source> \
--target_files <hyp1> <hyp2> ... \
--output_file <out>