google-research-datasets / wiki-atomic-edits

A dataset of atomic wikipedia edits containing insertions and deletions of a contiguous chunk of text in a sentence. This dataset contains ~43 million edits across 8 languages.
106 stars 8 forks source link

Regarding pre-processing tools used #4

Closed ajaynagesh closed 5 years ago

ajaynagesh commented 5 years ago

Hi Manaal/Ellie,

Thanks for this dataset! I was wondering if you could throw some more light on the pre-precessing tools used to generate this dataset (you mention Gillick, 2009, but it looks like that system is only for English). Especially for other languages such as Chinese. What was the word segmenter/sentence segmenter used here.

Thanks! Ajay

mfaruqui commented 5 years ago

Hi Ajay,

All the tools here were internal Google tools which can unfortunately not be released.

Manaal