Closed nick-magnini closed 8 years ago
First and foremost make sure that the SHA1 sum of the bz2 file that you downloaded is exactly the same as the one on the corresponding page for your wikidump date: http://dumps.wikimedia.org/enwiki/
my wikidump is the latest. enwiki-latest-pages-articles-multistream.xml.bz2
I tried to also ignore the error using scala.util._ to pass the IO error and continue ... Try { parser.parse() }
but anyway the maximum number of lines that I get as output is 2029599.
can you run shasum enwiki-latest-pages-articles-multistream.xml.bz2
and post the output here?
34cd7e3e5bb9b869d21f198c12fd0ce50288a51b
it agrees with the sum in : https://dumps.wikimedia.org/enwiki/20151201/enwiki-20151201-sha1sums.txt
So where the problem comes from? How can I ignore and skip the line which stops the process?
I'm downloading that wiki at the moment. It will take a while since wikiservers seems to be cutting the speed.
Just to rule out things, can you actually successfully decompress the file from the terminal?
I'm also decompressing it. I could decompress it to the terminal: bzip2 -dc enwiki-latest-pages-articles-multistream.xml.bz2
I think the problem is the enwiki-latest-pages-articles-multistream.xml.bz2. Decompressing it gave me this error after 10gb: bzip2: Data integrity error when decompressing. Input file = enwiki-latest-pages-articles-multistream.xml.bz2, output file = enwiki-latest-pages-articles-multistream.xml It is possible that the compressed file(s) have become corrupted. You can use the -tvv option to test integrity of such files. You can use the `bzip2recover' program to attempt to recover data from undamaged sections of corrupted files.
:( . Well worth notifying the wiki dump team.
What you could do at the moment is:
I'm downloading "https://dumps.wikimedia.org/enwiki/20151201/enwiki-20151201-pages-articles-multistream.xml.bz2" to see if the same problem happened. I'll let you know.
Meanwhile if you change the class to non-stream reader, let me know.
The same error at the same place happened with : https://dumps.wikimedia.org/enwiki/20151201/enwiki-20151201-pages-articles-multistream.xml.bz2
but you can decompress it correctly?
From the Wikipedia-dump mailing list, it seems that the dumping process slightly changed. I wonder if this is a bug. Im downloading that other one to check if it also corrupted
@nick-magnini any feedback on trying to decompress the second one?
Decompressing it with bzip faild at some point. I'm trying to use the recovery and see how that works.
Sent a message to: https://lists.wikimedia.org/pipermail/xmldatadumps-l/2015-December/thread.html
It seems given the current messages in that list that some of the code generating the xml dumps changed. The message is still to be accepted.
multistream
?Can you please let me know which one worked for you? Which one could you run with no error so I try to download and run over that. Thanks.
@nick-magnini You can always try downloading the model available via torrent. Which is generated from a dump early this year.
Probably you can try running it with this sept dump : https://dumps.wikimedia.org/enwiki/20150901/enwiki-20150901-pages-articles-multistream.xml.bz2
It will be great also if you can release the processed version of wikipedia.
@nick-magnini any luck ?
For me it went fine working with https://dumps.wikimedia.org/enwiki/20150901/enwiki-20150901-pages-articles-multistream.xml.bz2
Closing this then.
Hi,
After page 2029599, I get this error: ....... 2029599 Exception in thread "main" java.io.IOException: block overrun at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.getAndMoveToFrontDecode(BZip2CompressorInputStream.java:700) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.initBlock(BZip2CompressorInputStream.java:326) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupNoRandPartA(BZip2CompressorInputStream.java:884) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupNoRandPartB(BZip2CompressorInputStream.java:933) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.read0(BZip2CompressorInputStream.java:228) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.read(BZip2CompressorInputStream.java:179) at org.apache.xerces.impl.XMLEntityManager$RewindableInputStream.read(Unknown Source) at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at com.stratio.parsers.XMLDumpParser.parse(XMLDumpParser.java:49) at org.idio.wikipedia.dumps.ReadableWiki.createReadableWiki(ReadableWiki.scala:45) at org.idio.wikipedia.dumps.CreateReadableWiki$.main(ReadableWiki.scala:55) at org.idio.wikipedia.dumps.CreateReadableWiki.main(ReadableWiki.scala)
Can you figure out the possible error?