google-research-datasets / wiki-atomic-edits

A dataset of atomic wikipedia edits containing insertions and deletions of a contiguous chunk of text in a sentence. This dataset contains ~43 million edits across 8 languages.
106 stars 8 forks source link

Extracting differences between revisions from wikipedia dumps #6

Closed slemonide closed 4 years ago

slemonide commented 4 years ago

Dear Manaal and Ellie,

Can you please explain how the differences between revisions are computed? I have been trying to do that myself from Wikipedia dumps (for example, for http://dumps.wikimedia.your.org/enwiki/20191101/enwiki-20191101-pages-meta-history1.xml-p10p1042.bz2), and I noticed that some revisions contain the whole text of an article in tags and other only have part of it.

It would be great if you can share the piece of code used to do that work.

Thank you very much! Danil

mfaruqui commented 4 years ago

Hi,

Since we compared the snapshots from Google internal processed Wikipedia format, we do not have code that works on the public data format and did not experience your problem.

Manaal