Error in parsing wikipedia

nick-magnini commented 8 years ago

Hi,

After page 2029599, I get this error: ....... 2029599 Exception in thread "main" java.io.IOException: block overrun at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.getAndMoveToFrontDecode(BZip2CompressorInputStream.java:700) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.initBlock(BZip2CompressorInputStream.java:326) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupNoRandPartA(BZip2CompressorInputStream.java:884) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupNoRandPartB(BZip2CompressorInputStream.java:933) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.read0(BZip2CompressorInputStream.java:228) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.read(BZip2CompressorInputStream.java:179) at org.apache.xerces.impl.XMLEntityManager$RewindableInputStream.read(Unknown Source) at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at com.stratio.parsers.XMLDumpParser.parse(XMLDumpParser.java:49) at org.idio.wikipedia.dumps.ReadableWiki.createReadableWiki(ReadableWiki.scala:45) at org.idio.wikipedia.dumps.CreateReadableWiki$.main(ReadableWiki.scala:55) at org.idio.wikipedia.dumps.CreateReadableWiki.main(ReadableWiki.scala)

Can you figure out the possible error?

keynmol commented 8 years ago

First and foremost make sure that the SHA1 sum of the bz2 file that you downloaded is exactly the same as the one on the corresponding page for your wikidump date: http://dumps.wikimedia.org/enwiki/

nick-magnini commented 8 years ago

my wikidump is the latest. enwiki-latest-pages-articles-multistream.xml.bz2

I tried to also ignore the error using scala.util._ to pass the IO error and continue ... Try { parser.parse() }

but anyway the maximum number of lines that I get as output is 2029599.

dav009 commented 8 years ago

can you run shasum enwiki-latest-pages-articles-multistream.xml.bz2 and post the output here?

nick-magnini commented 8 years ago

34cd7e3e5bb9b869d21f198c12fd0ce50288a51b

dav009 commented 8 years ago

it agrees with the sum in : https://dumps.wikimedia.org/enwiki/20151201/enwiki-20151201-sha1sums.txt

nick-magnini commented 8 years ago

So where the problem comes from? How can I ignore and skip the line which stops the process?

dav009 commented 8 years ago

I'm downloading that wiki at the moment. It will take a while since wikiservers seems to be cutting the speed.

Just to rule out things, can you actually successfully decompress the file from the terminal?

nick-magnini commented 8 years ago

I'm also decompressing it. I could decompress it to the terminal: bzip2 -dc enwiki-latest-pages-articles-multistream.xml.bz2

nick-magnini commented 8 years ago

I think the problem is the enwiki-latest-pages-articles-multistream.xml.bz2. Decompressing it gave me this error after 10gb: bzip2: Data integrity error when decompressing. Input file = enwiki-latest-pages-articles-multistream.xml.bz2, output file = enwiki-latest-pages-articles-multistream.xml It is possible that the compressed file(s) have become corrupted. You can use the -tvv option to test integrity of such files. You can use the `bzip2recover' program to attempt to recover data from undamaged sections of corrupted files.

dav009 commented 8 years ago

:( . Well worth notifying the wiki dump team.

What you could do at the moment is:

You can try a previous dump
You could try using https://dumps.wikimedia.org/enwiki/20151201/enwiki-20151201-pages-articles.xml.bz2 which is not multistream, I guess we would have to change the class reading it but worth giving it a try, maybe it is able to handle non-multistream as well.

nick-magnini commented 8 years ago

I'm downloading "https://dumps.wikimedia.org/enwiki/20151201/enwiki-20151201-pages-articles-multistream.xml.bz2" to see if the same problem happened. I'll let you know.

Meanwhile if you change the class to non-stream reader, let me know.

nick-magnini commented 8 years ago

The same error at the same place happened with : https://dumps.wikimedia.org/enwiki/20151201/enwiki-20151201-pages-articles-multistream.xml.bz2

dav009 commented 8 years ago

but you can decompress it correctly?

dav009 commented 8 years ago

From the Wikipedia-dump mailing list, it seems that the dumping process slightly changed. I wonder if this is a bug. Im downloading that other one to check if it also corrupted

dav009 commented 8 years ago

@nick-magnini any feedback on trying to decompress the second one?

nick-magnini commented 8 years ago

Decompressing it with bzip faild at some point. I'm trying to use the recovery and see how that works.

dav009 commented 8 years ago

Sent a message to: https://lists.wikimedia.org/pipermail/xmldatadumps-l/2015-December/thread.html

It seems given the current messages in that list that some of the code generating the xml dumps changed. The message is still to be accepted.

Did you try decompressing the version that it is not multistream?
Do you need the latests version? I've done other things with the november, october dumps and they are alright

nick-magnini commented 8 years ago

Can you please let me know which one worked for you? Which one could you run with no error so I try to download and run over that. Thanks.

dav009 commented 8 years ago

@nick-magnini You can always try downloading the model available via torrent. Which is generated from a dump early this year.

Probably you can try running it with this sept dump : https://dumps.wikimedia.org/enwiki/20150901/enwiki-20150901-pages-articles-multistream.xml.bz2

nick-magnini commented 8 years ago

It will be great also if you can release the processed version of wikipedia.

dav009 commented 8 years ago

@nick-magnini any luck ?

nick-magnini commented 8 years ago

For me it went fine working with https://dumps.wikimedia.org/enwiki/20150901/enwiki-20150901-pages-articles-multistream.xml.bz2

keynmol commented 8 years ago

Closing this then.

idio / wiki2vec

Error in parsing wikipedia #11