infolab-csail / wikipedia-mirror

Makefiles that will download and setup a local wikipedia instance.
1 stars 2 forks source link

ArrayIndexOutOfBoundsException in make sql-dump-parts #18

Open alvaromorales opened 9 years ago

alvaromorales commented 9 years ago

I'm installing the 2015-06-02 dumps. I got an error in the make sql-dump-parts step. Parts 1-26 completed successfully, but the 27th file did not. I'm opening an issue, as instructed below.

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048
        at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:392)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
        at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88)
        at org.mediawiki.dumper.Dumper.main(Dumper.java:142)
The errored article is     <title>File:Trinity   <page> ( /scratch/wikipedia-mirror/drafts/errored_articles ). Fixing... (time: Wed Jun 24 23:24:31 EDT 2015)
Will remove article '    <title>File:Trinity   <page>' from file /scratch/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20150602-pages-meta-current27.xml-p029625001p046872015.fix.xml (size: 43533083690)
        Method: (blank is ok)
        search term: <title>    <title>File:Trinity   <page></title>
        file: /scratch/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20150602-pages-meta-current27.xml-p029625001p046872015.fix.xml
        title offset:
Found '' Grep-ing (cat  /scratch/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20150602-pages-meta-current27.xml-p029625001p046872015.fix.xml | grep -b -m 1 -F "<title>    <title>File:Trinity   <page></title>" | cut -d: -f1)
XML parse script failed. This is serous. report this at
        http://github.com/fakedrake/wikipedia-mirror/issues
fakedrake commented 9 years ago

Looks like the xml file is malformed or misread. Could you post the output of

grep -F "File:Trinity" -B 10 -A 10 /scratch/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20150602-pages-meta-current27.xml-p029625001p046872015.xml

and also the contents of /scratch/wikipedia-mirror/drafts/errored_articles

EDIT: also the outputs of

grep -F "File:Trinity" -B 10 -A 10 /scratch/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20150602-pages-meta-current27.xml-p029625001p046872015.fix.xml

and

tail -40 /scratch/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20150602-pages-meta-current27.xml-p029625001p046872015.fix.xml

Might be useful.

alvaromorales commented 9 years ago

Thanks for following up. I've included the output you requested. The output is pretty noisy, let me know if you want me to be more specific with grep.

grep -F "File:Trinity" -B 10 -A 10 /scratch/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20150602-pages-meta-current27.xml-p029625001p046872015

https://paste.ee/r/uQtk9

the contents of /scratch/wikipedia-mirror/drafts/errored_articles

Ronald J. Rabago
Grażyna (poem)
Wikipedia:WikiProject Spam/LinkReports/firmenpresse.de
Wikipedia:WikiProject Spam/LinkReports/9sky.com
Wikipedia:Reference desk/Archives/Humanities/2011 December 30
    <title>File:Trinity   <page>
    <title>File:Trinity   <page>
    <title>File:Trinity   <page>
Wikipedia:WikiProject Spam/LinkReports/9sky.com
Wikipedia:Reference desk/Archives/Humanities/2011 December 30

grep -F "File:Trinity" -B 10 -A 10 /scratch/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20150602-pages-meta-current27.xml-p029625001p046872015.fix.xml

https://paste.ee/r/R1TZH

tail -40 /scratch/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20150602-pages-meta-current27.xml-p029625001p046872015.fix.xml

https://paste.ee/r/XbVso

fakedrake commented 9 years ago

Just so this is documented and you are not completely in the dark:

Due to a bug(?) in mwdumper when feeding it the xml expecting sql sometimes xerces (the xml parser) throws the exception that you saw, namely

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048
        at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:392)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
        at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88)
        at org.mediawiki.dumper.Dumper.main(Dumper.java:142)

I found that removing the offending <page>...</page> entry fixes the problem at the cost of missing a page.

The way I went around implementing this is: the downloaded xml gets parsed into another xml by mwdumper. If this process fails we look backwards into the output xml file for a title tag. Then we remove the relevant page and try again until mwdumper is happy. Then mwdumper parses the "correct" xml into sql. All articles removed this way are logged into drafts/errored_articles.

The problem is almost definitely with my code, I will take a look at it shortly.

dldharma commented 8 years ago

We have encountered the same error. Please do advice if there is a fix in place. Regards, D

fakedrake commented 8 years ago

@dldharma just so I don't have to download everything from scratch, do you have it on an infolab machine?

dldharma commented 8 years ago

Unfortunately, not on infolab machine. Have it on our cloud server on AWS.

Sent from my iPhone

On 30-Sep-2016, at 1:45 AM, Chris Perivolaropoulos notifications@github.com wrote:

@dldharma just so I don't have to download everything from scratch, do you have it on an infolab machine?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

michaelsilver commented 8 years ago

@dldharma this project is now mostly defunct -- we had a lot of problems setting up a full mirror of Wikipedia. Check out WikipediaBase, a virtual database that uses a combination of local data obtained from the Wikipedia dumps and live fetched from the Wikipedia API.

dldharma commented 8 years ago

Thanks a lot for your prompt reply. Appreciate. WikipediaBase looks great and trying it right now. START is an amazing initiative - tried it - and works great !

Regards, Dileep

On Fri, Sep 30, 2016 at 7:53 AM, Michael Silver notifications@github.com wrote:

@dldharma https://github.com/dldharma this project is now mostly defunct -- we had a lot of problems setting up a full mirror of Wikipedia. Check out WikipediaBase https://github.com/infolab-csail/WikipediaBase, a virtual database that uses a combination of local data obtained from the Wikipedia dumps https://dumps.wikimedia.org/enwiki/20160920/ and live fetched from the Wikipedia API https://www.mediawiki.org/wiki/API:Main_page.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/infolab-csail/wikipedia-mirror/issues/18#issuecomment-250643193, or mute the thread https://github.com/notifications/unsubscribe-auth/AImSodIjbvONRQirK7o5VNhIWVR1usLqks5qvHKugaJpZM4FLZKl .

dldharma commented 8 years ago

@michaelsilver the import process successfully completed. It populated Articles, Classes and Article Classes mappings. Thanks once again for sharing WikipediaBase.

On reviewing the data, found that article categories are present only in Article.markup. To my surprise, the category tables (parent and child category composite relations) and category relation with articles are completely missing. Did spike the code and will need to enhance the mechanism to support the same. Any ideas or alternates for category challenge ?

Regards, Dileep

michaelsilver commented 8 years ago

@dldharma why don't you make an issue in WikipediaBase and we can discuss further there. When you create the issue, please provide a printout of the tables you have populated (\d in postgres) and describe what you mean by "article categories". By category, do you mean which type of infobox the article has?

dldharma commented 8 years ago

@michaelsilver agree. Created issue 277. Thanks once again for your prompt replies. Appreciate !