A dataset of atomic wikipedia edits containing insertions and deletions of a contiguous chunk of text in a sentence. This dataset contains ~43 million edits across 8 languages.
Thanks for this dataset! I was wondering if you could throw some more light on the pre-precessing tools used to generate this dataset (you mention Gillick, 2009, but it looks like that system is only for English). Especially for other languages such as Chinese. What was the word segmenter/sentence segmenter used here.
Hi Manaal/Ellie,
Thanks for this dataset! I was wondering if you could throw some more light on the pre-precessing tools used to generate this dataset (you mention Gillick, 2009, but it looks like that system is only for English). Especially for other languages such as Chinese. What was the word segmenter/sentence segmenter used here.
Thanks! Ajay