Closed GoogleCodeExporter closed 9 years ago
I just successfully converted the English dump from 2011-05-26 using
JWPL_DATAMACHINE_0.6.0.jar with options english Categories Disambiguation_pages.
Two things to consider:
(i) always check the checksum after downloading a dump to avoid working with
corrupt files
(ii) the line "Discussions are available" from your output indicates that you
are using an unnecessary large dump (with unknown side-effects). Make sure you
use the file "pages-articles.xml.bz2" and not the larger dumps if you are not
interested in them.
-Torsten
Original comment by torsten....@gmail.com
on 28 Jun 2011 at 12:17
Thanks Torsten for the quick reply
I just want to verify with you the steps. Here is what I did
I downloaded below files from http://dumps.wikimedia.org/enwiki/20110526/ as
mentioned in http://code.google.com/p/jwpl/wiki/DataMachine
enwiki-20110526-pages-articles.xml.bz2
enwiki-20110526-categorylinks.sql.gz
enwiki-20110526-pagelinks.sql.gz
Files got downloaded without any network disturbance. But the sizes of the
downloads were not exactly matching with the ones mentioned on the download
page.
I verified the checksum with md5sum command available in Ubantu 10.10. But the
checksums are not matching. I tried twice downloading (deleted first one when
second was loaded) and I got the same error. So I wonder how can it go wrong
both the times...
Thanks
Original comment by ambha.ca...@gmail.com
on 28 Jun 2011 at 6:36
[deleted comment]
How are you downloading the files?
Try using wget. It's less likely to produce corrupt files.
-Oliver
Original comment by oliver.ferschke
on 28 Jun 2011 at 10:45
Original issue reported on code.google.com by
ambha.ca...@gmail.com
on 27 Jun 2011 at 7:47