Closed Nipi64310 closed 3 years ago
That seems weird given that the number for the other languages are correct, but perhaps the En target file has gotten downloaded only partially. Could you try cloning the repository again and counting the lines in the En target file to see if you get the following:
$ wc -l targets/clang8_en.detokenized.tsv
2372119 targets/clang8_en.detokenized.tsv
Thanks for the reply, the number of lines is incorrect, because I used the wget command to download target/clang8_en.detokenized.tsv before, and the file may be incomplete due to network reasons.
hello Thanks for sharing the data and scripts. After I downloaded the data, when I executed run.sh, I found that there were only 729,014 pairs of the en type, and the readme indicated that there were 2,372,119 pairs. Is it because I did not operate properly?