About prepare_data.sh - Githubissues

nusnlp / mlconvgec2018

Code and model files for the paper: "A Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction" (AAAI-18).

GNU General Public License v3.0

185 stars 73 forks source link

About prepare_data.sh #8

Closed cwlinghk closed 6 years ago

cwlinghk commented 6 years ago

I saw we have to remove empty target sentences for the NUCLE development data. Do we have to do the same for the NUCLE training data? Thank you very much.

shamilcm commented 6 years ago

The parallel training data (NUCLE+Lang8) is cleaned https://github.com/nusnlp/mlconvgec2018/blob/master/data/prepare_data.sh#L94 so that only non-empty sentence pairs are retained.

awasthiabhijeet commented 5 years ago

In many <incorrect, correct> pairs of lang-8 data, there are additional comments. If we feed this to our models as is, I think it will not be very useful. I guess the current scripts for cleaning data does not remove additional comments provided along with correct sentences. Is there any script which handles this problem?

shamilcm commented 5 years ago

No, the current pre-processing pipeline does not involve any specific rules to remove additional comments. However, the clean-corpus-n.perl script (from Moses SMT toolkit) that is used within the preprocess.sh script removes source-target sentence pairs which are substantially different in terms of length.

awasthiabhijeet commented 5 years ago

Thanks, I see. Removing source-target pairs where len(target)> 1.5*len(source) rejects around 30% of data. :(

shamilcm commented 5 years ago

I think the ratio used is 9, not 1.5. The script also removes sentences which are more than 80 tokens and less than 1 token.