nusnlp / mlconvgec2018

Code and model files for the paper: "A Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction" (AAAI-18).
GNU General Public License v3.0
185 stars 73 forks source link

the size of training dataset? #18

Closed wangwang110 closed 5 years ago

wangwang110 commented 5 years ago

in the project, 2210277 sentence pairs have been used including lang-8 and NUCLE. But the paper says 1.3M sentence pairs used to train. so I want to know the detailed size of lang-8 and NUCLE you used to train the models.

shamilcm commented 5 years ago

We only use the annotated sentence pairs for training (see preprocess.sh inside training/ directory) https://github.com/nusnlp/mlconvgec2018/blob/415bf08afe6eb40be4a8d7fe4b05ee37b085aebd/training/preprocess.sh#L34-L36

See the size of train.src and train.trg after running preprocess.sh on the extracted 2210277 sentence pairs.

wangwang110 commented 5 years ago

We only use the annotated sentence pairs for training (see preprocess.sh inside training/ directory) mlconvgec2018/training/preprocess.sh

Lines 34 to 36 in 415bf08

python $SCRIPTS_DIR/get_diff.py processed/train.all src trg > processed/train.annotated.src-trg cut -f1 processed/train.annotated.src-trg > processed/train.src cut -f2 processed/train.annotated.src-trg > processed/train.trg See the size of train.src and train.trg after running preprocess.sh on the extracted 2210277 sentence pairs.

We only use the annotated sentence pairs for training (see preprocess.sh inside training/ directory) mlconvgec2018/training/preprocess.sh

Lines 34 to 36 in 415bf08

python $SCRIPTS_DIR/get_diff.py processed/train.all src trg > processed/train.annotated.src-trg cut -f1 processed/train.annotated.src-trg > processed/train.src cut -f2 processed/train.annotated.src-trg > processed/train.trg See the size of train.src and train.trg after running preprocess.sh on the extracted 2210277 sentence pairs.

ok, thank you !