Closed cwlinghk closed 6 years ago
The parallel training data (NUCLE+Lang8) is cleaned https://github.com/nusnlp/mlconvgec2018/blob/master/data/prepare_data.sh#L94 so that only non-empty sentence pairs are retained.
In many <incorrect, correct> pairs of lang-8 data, there are additional comments. If we feed this to our models as is, I think it will not be very useful. I guess the current scripts for cleaning data does not remove additional comments provided along with correct sentences. Is there any script which handles this problem?
No, the current pre-processing pipeline does not involve any specific rules to remove additional comments. However, the clean-corpus-n.perl script (from Moses SMT toolkit) that is used within the preprocess.sh script removes source-target sentence pairs which are substantially different in terms of length.
Thanks, I see. Removing source-target pairs where len(target)> 1.5*len(source) rejects around 30% of data. :(
I think the ratio used is 9, not 1.5. The script also removes sentences which are more than 80 tokens and less than 1 token.
I saw we have to remove empty target sentences for the NUCLE development data. Do we have to do the same for the NUCLE training data? Thank you very much.