google-research-datasets / clang8

cLang-8 is a dataset for grammatical error correction.
100 stars 5 forks source link

The size of the datasets #9

Open DarlingJOJO opened 2 years ago

DarlingJOJO commented 2 years ago

I‘d like to know why the size of cLang-8 is larger than the original Lang-8. cLang-8 contains 2372119 English sent-pairs, while Lang-8 contains only 1037561 English sent-pairs.

ashokrajab commented 2 years ago

I was wondering the same. If the author's of clang8 could clarify this, it will be really helpful.

cc @ekQ

ekQ commented 2 years ago

We use the raw Lang-8 dataset with 237,843 English entries (each consisting of multiple sentences) while the dataset with 1,037,561 English sent-pairs that you're referring to probably corresponds to the cleaned English v1.0 corpus with 100,051 entries.