A dataset of atomic wikipedia edits containing insertions and deletions of a contiguous chunk of text in a sentence. This dataset contains ~43 million edits across 8 languages.
106
stars
8
forks
source link
Extracting differences between revisions from wikipedia dumps #6
Since we compared the snapshots from Google internal processed Wikipedia format, we do not have code that works on the public data format and did not experience your problem.
Dear Manaal and Ellie,
Can you please explain how the differences between revisions are computed? I have been trying to do that myself from Wikipedia dumps (for example, for http://dumps.wikimedia.your.org/enwiki/20191101/enwiki-20191101-pages-meta-history1.xml-p10p1042.bz2), and I noticed that some revisions contain the whole text of an article in tags and other only have part of it.
It would be great if you can share the piece of code used to do that work.
Thank you very much! Danil