Closed wangwang110 closed 5 years ago
We only use the annotated sentence pairs for training (see preprocess.sh inside training/ directory) https://github.com/nusnlp/mlconvgec2018/blob/415bf08afe6eb40be4a8d7fe4b05ee37b085aebd/training/preprocess.sh#L34-L36
See the size of train.src and train.trg after running preprocess.sh
on the extracted 2210277 sentence pairs.
We only use the annotated sentence pairs for training (see preprocess.sh inside training/ directory) mlconvgec2018/training/preprocess.sh
Lines 34 to 36 in 415bf08
python $SCRIPTS_DIR/get_diff.py processed/train.all src trg > processed/train.annotated.src-trg cut -f1 processed/train.annotated.src-trg > processed/train.src cut -f2 processed/train.annotated.src-trg > processed/train.trg See the size of train.src and train.trg after running
preprocess.sh
on the extracted 2210277 sentence pairs.We only use the annotated sentence pairs for training (see preprocess.sh inside training/ directory) mlconvgec2018/training/preprocess.sh
Lines 34 to 36 in 415bf08
python $SCRIPTS_DIR/get_diff.py processed/train.all src trg > processed/train.annotated.src-trg cut -f1 processed/train.annotated.src-trg > processed/train.src cut -f2 processed/train.annotated.src-trg > processed/train.trg See the size of train.src and train.trg after running
preprocess.sh
on the extracted 2210277 sentence pairs.
ok, thank you !
in the project, 2210277 sentence pairs have been used including lang-8 and NUCLE. But the paper says 1.3M sentence pairs used to train. so I want to know the detailed size of lang-8 and NUCLE you used to train the models.