google-research-datasets / clang8

cLang-8 is a dataset for grammatical error correction.
100 stars 5 forks source link

The number of en types does not match the one written in the readme? #1

Closed Nipi64310 closed 3 years ago

Nipi64310 commented 3 years ago

hello Thanks for sharing the data and scripts. After I downloaded the data, when I executed run.sh, I found that there were only 729,014 pairs of the en type, and the readme indicated that there were 2,372,119 pairs. Is it because I did not operate properly? image

ekQ commented 3 years ago

That seems weird given that the number for the other languages are correct, but perhaps the En target file has gotten downloaded only partially. Could you try cloning the repository again and counting the lines in the En target file to see if you get the following:

$ wc -l targets/clang8_en.detokenized.tsv
2372119 targets/clang8_en.detokenized.tsv
Nipi64310 commented 3 years ago

Thanks for the reply, the number of lines is incorrect, because I used the wget command to download target/clang8_en.detokenized.tsv before, and the file may be incomplete due to network reasons.