Open alvaromorales opened 9 years ago
Looks like the xml file is malformed or misread. Could you post the output of
grep -F "File:Trinity" -B 10 -A 10 /scratch/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20150602-pages-meta-current27.xml-p029625001p046872015.xml
and also the contents of /scratch/wikipedia-mirror/drafts/errored_articles
EDIT: also the outputs of
grep -F "File:Trinity" -B 10 -A 10 /scratch/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20150602-pages-meta-current27.xml-p029625001p046872015.fix.xml
and
tail -40 /scratch/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20150602-pages-meta-current27.xml-p029625001p046872015.fix.xml
Might be useful.
Thanks for following up. I've included the output you requested. The output is pretty noisy, let me know if you want me to be more specific with grep.
grep -F "File:Trinity" -B 10 -A 10 /scratch/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20150602-pages-meta-current27.xml-p029625001p046872015
the contents of /scratch/wikipedia-mirror/drafts/errored_articles
Ronald J. Rabago
Grażyna (poem)
Wikipedia:WikiProject Spam/LinkReports/firmenpresse.de
Wikipedia:WikiProject Spam/LinkReports/9sky.com
Wikipedia:Reference desk/Archives/Humanities/2011 December 30
<title>File:Trinity <page>
<title>File:Trinity <page>
<title>File:Trinity <page>
Wikipedia:WikiProject Spam/LinkReports/9sky.com
Wikipedia:Reference desk/Archives/Humanities/2011 December 30
grep -F "File:Trinity" -B 10 -A 10 /scratch/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20150602-pages-meta-current27.xml-p029625001p046872015.fix.xml
tail -40 /scratch/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20150602-pages-meta-current27.xml-p029625001p046872015.fix.xml
Just so this is documented and you are not completely in the dark:
Due to a bug(?) in mwdumper when feeding it the xml expecting sql sometimes xerces (the xml parser) throws the exception that you saw, namely
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048
at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:392)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88)
at org.mediawiki.dumper.Dumper.main(Dumper.java:142)
I found that removing the offending <page>...</page>
entry fixes the problem at the cost of missing a page.
The way I went around implementing this is: the downloaded xml gets parsed into another xml by mwdumper. If this process fails we look backwards into the output xml file for a title
tag. Then we remove the relevant page and try again until mwdumper is happy. Then mwdumper parses the "correct" xml into sql. All articles removed this way are logged into drafts/errored_articles
.
The problem is almost definitely with my code, I will take a look at it shortly.
We have encountered the same error. Please do advice if there is a fix in place. Regards, D
@dldharma just so I don't have to download everything from scratch, do you have it on an infolab machine?
Unfortunately, not on infolab machine. Have it on our cloud server on AWS.
Sent from my iPhone
On 30-Sep-2016, at 1:45 AM, Chris Perivolaropoulos notifications@github.com wrote:
@dldharma just so I don't have to download everything from scratch, do you have it on an infolab machine?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
@dldharma this project is now mostly defunct -- we had a lot of problems setting up a full mirror of Wikipedia. Check out WikipediaBase, a virtual database that uses a combination of local data obtained from the Wikipedia dumps and live fetched from the Wikipedia API.
Thanks a lot for your prompt reply. Appreciate. WikipediaBase looks great and trying it right now. START is an amazing initiative - tried it - and works great !
Regards, Dileep
On Fri, Sep 30, 2016 at 7:53 AM, Michael Silver notifications@github.com wrote:
@dldharma https://github.com/dldharma this project is now mostly defunct -- we had a lot of problems setting up a full mirror of Wikipedia. Check out WikipediaBase https://github.com/infolab-csail/WikipediaBase, a virtual database that uses a combination of local data obtained from the Wikipedia dumps https://dumps.wikimedia.org/enwiki/20160920/ and live fetched from the Wikipedia API https://www.mediawiki.org/wiki/API:Main_page.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/infolab-csail/wikipedia-mirror/issues/18#issuecomment-250643193, or mute the thread https://github.com/notifications/unsubscribe-auth/AImSodIjbvONRQirK7o5VNhIWVR1usLqks5qvHKugaJpZM4FLZKl .
@michaelsilver the import process successfully completed. It populated Articles, Classes and Article Classes mappings. Thanks once again for sharing WikipediaBase.
On reviewing the data, found that article categories are present only in Article.markup. To my surprise, the category tables (parent and child category composite relations) and category relation with articles are completely missing. Did spike the code and will need to enhance the mechanism to support the same. Any ideas or alternates for category challenge ?
Regards, Dileep
@dldharma why don't you make an issue in WikipediaBase and we can discuss further there. When you create the issue, please provide a printout of the tables you have populated (\d
in postgres) and describe what you mean by "article categories". By category, do you mean which type of infobox the article has?
I'm installing the 2015-06-02 dumps. I got an error in the
make sql-dump-parts
step. Parts 1-26 completed successfully, but the 27th file did not. I'm opening an issue, as instructed below.