Undetected invalid utf8 character at dump no20

fakedrake commented 10 years ago

...

376,000 pages (14,460.426/sec), 376,000 revs (14,460.426/sec)
377,000 pages (14,458.848/sec), 377,000 revs (14,458.848/sec)
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048
        at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:392)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
        at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88)
        at org.mediawiki.dumper.Dumper.main(Dumper.java:142)
make: *** [/scratch/cperivol/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20131202-pages-articles20.xml-p011125004p013324998.sql] Error 1

The failing command is

java -jar /scratch/cperivol/wikipedia-mirror/tools/mwdumper.jar   --format=sql:1.5 /scratch/cperivol/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20131202-pages-articles20.xml-p011125004p013324998.fix.xml > /scratch/cperivol/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20131202-pages-articles20.xml-p011125004p013324998.sql

Also dont worry too much about running it because the time is

26.65s user 1.73s system 78% cpu 35.949 total

fakedrake commented 10 years ago

First of all I will try exporting it to xml to get an idea of where the error. The final line was in the Cranopsis bocourti article. The next article is Amietophrynus brauni

fakedrake commented 10 years ago

The xerces Java github repo (clone it, impl is not visible online for some reason) does not yield too much information.

fakedrake commented 10 years ago

The commands that are not advised if you are running on a slow machine or if the files you are dealing with are much larger than mine I prepend with SLOW. For each one I will provide sooner or later smarter ways of getting the same or similar results.

Snooping around

To find out where the process fails here is what I do:

$ java -jar /scratch/cperivol/wikipedia-mirror/tools/mwdumper.jar   --format=xml /scratch/cperivol/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20131202-pages-articles20.xml-p011125004p013324998.fix.xml > /tmp/just-a-copy.xml

The same error as above should be yielded (it is a problem with reading). Then I can find the last two articles that mdumper spit out with

$ tac /tmp/just-a-copy.xml | grep "<title>" -m 2 | tac
<title>The roaring 20s</title>
<title>Cranopsis bocourti</title> # <- This is the last one

At this point I will do

  $ export ORIGINAL_XML=/scratch/cperivol/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20131202-pages-articles20.xml-p011125004p013324998.fix.xml

So I dont have to type the name each time and so the commands match regardless of the path (better late than never). This is a fast operation because we read the file from the end back.

Then if I am brave and strong I can find the article in the original xml file with

SLOW

$ grep "<title>Cranopsis bocourti</title>" -A 200 -B 100 $ORIGINAL_XML | less

| less is to browse and -A 200 -B 100 is 200 lines after and 100 before the point. Now everything seems to be normal, which is extremely weird. Later I will demonstrate how to inspect the area in a more efficient way.

I want to take a look at the structure of the XML document. Specifically I want to see what are the parents of title. I am quite lucky that the generated xml is indented so let's use that to find all the parent tags of the cut-in-the-middle article. First let's see how deep we are right now:

$ tac /tmp/just-a-copy.xml | grep "^ *<[^/]" -m 1 | tac
<text xml:space="preserve">&lt;!-- This article was auto-generated by [[User:Polbot]]. --&gt;

I count 6 spaces so:

CAUTION This is pretty expensive. I could be smarter by looking at the head of the file with head $ORIGINAL_XML to determine level 0 and then run the command only for layer 2, 4 and 6. This would be much faster in cpu time and I would prefer it if I had the entire dump in a single file but right now I am not lacking in resources. So here is me being lazy.

SLOW

$ for i in {0..6}; do echo "Level $i:"; tac /tmp/just-a-copy.xml | grep "^ \{$i\}<[^/]" -m 1 -n | tac; done
Level 0:
17564960:<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.3/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.3/ http://www.mediawiki.org/xml/export-0.3.xsd" version="0.3" xml:lang="en">
Level 1:
Level 2:
38:  <page>
Level 3:
Level 4:
35:    <revision>
Level 5:
Level 6:
26:      <text xml:space="preserve">&lt;!-- This article was auto-generated by [[User:Polbot]]. --&gt;

Looks like it's just pages thrown in a grand domain called mediawiki. We could have seen that from the java source too but as expensive as this is, it is much faster than dealing with the source of a project I have no idea about.

Hope

So anyway, the line numbers denote how far the line is from the end of the file. I will try to see if this is a local problem by trying to parse just this page or if it chokes on it based on it's environment. I see that I can actually afford to just throw away the siblings at the first level of the tree as <page> at level 1 is only 38 lines away and I expect </page> to be similarly close.

Now the obvious way of doing something like that is awk but let's see if we can avoid such an expensive operation. First of all the original and the generated xml have some, not very substatntial, but scary differences:

$ cmp /tmp/just-a-copy.xml $ORIGINAL_XML
/tmp/just-a-copy.xml /scratch/cperivol/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20131202-pages-articles20.xml-p011125004p013324998.fix.xml differ: byte 2, line 1

That was early. And also there is no real way around it

$ head $ORIGINAL_XML

  
    Wikipedia
    http://en.wikipedia.org/wiki/Main_Page
    MediaWiki 1.23wmf4
    first-letter
    
      Media
      Special
      
$ head /tmp/just-a-copy.xml


  
    Wikipedia
    http://en.wikipedia.org/wiki/Main_Page
    MediaWiki 1.23wmf4
    first-letter
    
      Media

Looks similar but the generated misses a couple of xml attributes and has different versions. Let's see if things are anywhere ear where we expect them based on the generated file. Well we know the generated file is 17564960+1 lines long If you didn't run it you can find that out with

$ wc -l /tmp/just-a-copy.xml
17564961 /tmp/just-a-copy.xml

Which shouldn't take more than 10-20s in the worst case (for me it's almost instant). So let's see what's on the supposed last line of just-a-copy.xml

$ sed "17564960q;d" $ORIGINAL_XML
[[Willie Jones (American football)|Willie Jones]],

Football: nothing to do with frogs (at least we ar n ot within 20-30 lines) so instead of spending too much effort looking into fine tuning line numbers I will employ a more general method.

Taking the issue seriously

First of looking at lines is slow, looking at bytes gets the kernel to do stuff which is always a good idea. So I will first find at which byte is the article I am interested in.

$ grep -b "<title>Cranopsis bocourti</title>" -m 1 $ORIGINAL_XML
1197420547:    <title>Cranopsis bocourti</title>

This may take a little while but you are stuck with it unfortunately. So let's get the bottom half

$ dd if=$ORIGINAL_XML skip=1197420547 ibs=1 | sed -n '/<\/page>/{p;q};p' > /tmp/original_tail.xml

Note that you could play with ibs and skip values but even this for me was instantaneous

Now I want the head of the page tag that contains the byte 1197420547. Unfortunately dd will not read in reverse. It will however let me take a specific portion of the file. 1K is easily manageable by tac and sed even on embedded devices so I will try with that.

$ dd if=$ORIGINAL_XML count=1000 skip=$((1197420547-1000)) ibs=1 | tac | sed '/<page>/q' | tac > /tmp/original_head.xml

It turns out we just needed about a dozen bytes as the tag was exactly above but if it wasn't it would be a good idea to just jump to a number we find manageable.

So anyway, we have our head tag from the loop command which is at the top so to create just the page and mediawiki tage

$ head -1 $ORIGINAL_XML | cat - /tmp/original_page.xml > /tmp/original_lite.xml; echo "</mediawiki>" >> /tmp/original_lite.xml

And there you have it.

fakedrake commented 10 years ago

So here is the result from just the page:

$ java -jar /scratch/cperivol/wikipedia-mirror/tools/mwdumper.jar   --format=xml /tmp/original_lite.xml 
Exception in thread "main" java.lang.NullPointerException
        at org.mediawiki.importer.XmlDumpReader.readTitle(XmlDumpReader.java:317)
        at org.mediawiki.importer.XmlDumpReader.endElement(XmlDumpReader.java:214)
        at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:392)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
        at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88)
        at org.mediawiki.dumper.Dumper.main(Dumper.java:142)

Nice! Wait what? A different error. Woops. Before we continue let's take a look at the head of the original XML

head -200 $ORIGINAL_XML | less

a <siteinfo> tag. Great. Let's try adding that to out file

$ (head -1 /tmp/original_lite.xml; sed -n "/<siteinfo>/,/<\/siteinfo>/p;/<\/siteinfo>/q" $ORIGINAL_XML ; tail -n+1 /tmp/original_lite.xml) > /tmp/original_lite.xml

This just uses sed to get just the siteinfo and puts it right after the first line of the file. If you run it twice by mistake and have no proper editor at hand (like me)

fakedrake commented 10 years ago

$ dd if=$ORIGINAL_XML skip=1197420547 ibs=1 | sed -n '/<\/page>/{p;q};p' > /tmp/original_tail.xml

$ dd if=$ORIGINAL_XML count=1000 skip=$((1197420547-1000)) ibs=1 | tac | sed '/<page>/q' | tac > /tmp/original_head.xml

Here is a compact version of creating the original_lite.xml file. Please consider that I am not mad, it's just that experimentin with stuff I broke orginial_lite.xml a lot so this came in handy

(head -1 $ORIGINAL_XML; sed -n "/<siteinfo>/,/<\/siteinfo>/p;/<\/siteinfo>/q" $ORIGINAL_XML ; cat /tmp/original_head.xml ; cat /tmp/original_tail.xml; tail -1 $ORIGINAL_XML ) > /tmp/original_lite.xml

fakedrake commented 10 years ago

To my horror it actually works. The command is for quick reference

 $ java -jar /scratch/cperivol/wikipedia-mirror/tools/mwdumper.jar   --format=xml /tmp/original_lite.xml

and no error is yielded. I will make a script to easily incrementally add pages untill something breaks.

fakedrake commented 10 years ago

I composed all of the above in a script (it looks small because most of the work we did was investigative) edit: cool new version that gets the article titles as arguments. It will take a while if they do not exist.

#!/bin/bash
#
# Dont really parse it, just do some nassty tricks to make a subset of
# the xml that makes sense
ORIGINAL_XML=/scratch/cperivol/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20131202-pages-articles20.xml-p011125004p013324998.fix.xml
# Throw page in stdout
function xml_page {
    term="$@"
    title_offset=$(grep -b -F "$term" -m 1 $ORIGINAL_XML | grep -o "[0-9]*" | head -1)
    if [ ! $title_offset ]; then
    echo "Found '$title_offset' Grep-ing (grep -b -F \"$term\" -m 1 $ORIGINAL_XML | grep -o '[0-9]*')"
    grep -b -F "$term" -m 1 $ORIGINAL_XML | grep -o "[0-9]*"
    exit 0
    fi
    count=1000
    if [[ $title_offset -lt 1000 ]]; then
    count=$title_offset
    fi
    dd if=$ORIGINAL_XML count=$count skip=$(($title_offset-$count)) ibs=1 | tac | sed '//q'
    dd if=$ORIGINAL_XML skip=$title_offset ibs=1 | sed -n '/<\/page>/{p;q};p'
}
# Put stdin betwinn mediawiki tags and into stdout
function mediawiki_xml {
    (head -1 $ORIGINAL_XML; sed -n "//,/<\/siteinfo>/p;/<\/siteinfo>/q" $ORIGINAL_XML ; cat - ; tail -1 $ORIGINAL_XML )
}
(for i; do xml_page "$i"; done) | mediawiki_xml

fakedrake commented 10 years ago

I a file with the last 4 proessed articles to stdout with:

 $ tac /tmp/just-a-copy.xml | grep -F "<title>" | head -4 | tac | sed 's/ *<title>\(.*\)<\/title>/\1/' | xargs -d '\n' bash data/xml-parse.sh

20 articles did not cause any problem... Neither did 500 but the entire thing is 377032 pages so better luck next time. I dont think however that we died of the information bulk

$ tac drafts/wikipedia-parts/enwiki-20131202-pages-articles19.xml-p009225002p011124997.fix.xml | grep -F "<title>" | wc -l 
476679

fakedrake commented 10 years ago

So now I have a script that actually removes entire pages but it uses the std pipes so it can be quite slow even with dd (4.8MB/s makes about 5-6mins for a very simple case). I also do not expect to realy on infinite storage. That's why I wrote a small C program that covers regions of a file with spaces and that is done in-place so it should be very much faster.

fakedrake commented 10 years ago

I was actually a bit surprised to find out that it made NO difference wether dd copied stuff internally or things were thrown through stdout

( neg_xml_page "Cranopsis bocourti" > /tmp/huge.xml; )  79.10s user 368.78s system 98% cpu 7:36.15 total

And

( neg_xml_page "Cranopsis bocourti" /tmp/huge.xml; )  78.51s user 369.86s system 97% cpu 7:37.92 total

Wow! And not only that the second is much harder to implement when I am trying to chain mpore than one dd calls

fakedrake commented 10 years ago

I find it extremely weird that this actually worked. I should feel happy and liberated from the problem I couldn't understand but now I feel alone, confused and in need of a mars bar. I also worder if I should actually try my C program to see if it works aswell...

just to document the commands used:

. data/xml-parse.sh && time (neg_xml_page "Cranopsis bocourti" > /tmp/huge.xml)
java -jar tools/mwdumper.jar --format=xml /tmp/huge.xml > /dev/null

(Obviously hange xml to sql:1.5 and /dev/null to a file)

fakedrake commented 10 years ago

To clean things up for article 20 to retest the c code in the large set you can

bzcat -dv /scratch/cperivol/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20131202-pages-articles20.xml-p011125004p013324998.bz2 > /scratch/cperivol/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20131202-pages-articles20.xml-p011125004p013324998.raw.xml

fakedrake commented 10 years ago

Remember the original "inevitable" 7-8 minutes to clean out a page. Turns out covering it takes less than a sec:

$ time data/xml-parse.sh $ORIGINAL_XML Cranopsis\ bocourti inplace
Method: inplace
        search term: Cranopsis bocourti
        title offset: 1197420551
1000+0 records in
1+1 records out
1000 bytes (1.0 kB) copied, 0.000432871 s, 2.3 MB/s
        to page start: 13
        to page end: 1972
        page start: 1197420538
        page end: 1197422523,
        bytes to copy: 2181880699
Using in place covering with /scratch/cperivol/wikipedia-mirror/data/page_remover..
Running: /scratch/cperivol/wikipedia-mirror/data/page_remover /scratch/cperivol/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20131202-pages-articles20.xml-p011125004p013324998.fix.xml 1197420538 1985
Opening /scratch/cperivol/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20131202-pages-articles20.xml-p011125004p013324998.fix.xml at 1197420538 (len: 1985)
data/xml-parse.sh $ORIGINAL_XML Cranopsis\ bocourti inplace  0.32s user 0.14s system 90% cpu 0.508 total

That however did not fix the error...

fakedrake commented 10 years ago

So let's see if it did what we expected it to do

$ grep -n -F "Cranopsis bocourti" -m 1 $ORIGINAL_XML
19529323:    Cranopsis bocourti

and then

$ sed -n "19529123,19529423p" $ORIGINAL_XML | less

Shows that the blanks were actually inserted correctly...

[...]
<page>
  <title>The roaring 20s</title>
  <ns>0</ns>
  <id>12358588</id>
  <redirect title="Roaring Twenties" />
  <revision>
    <id>146066361</id>
    <timestamp>2007-07-21T04:40:06Z</timestamp>
    <contributor>
      <username>DBaba</username>
      <id>1322148</id>
    </contributor>
    <comment>[[WP:AES|←]]Redirected page to [[Roaring Twenties]]</comment>
    <text xml:space="preserve">#redirect [[Roaring Twenties]]</text>
    <sha1>0gi4dhdbm0pcyvjlm5q0stejqnk3wpy</sha1>
    <model>wikitext</model>
    <format>text/x-wiki</format>
  </revision>
</page>

<page>
  <title>Amietophrynus brauni</title>
  <ns>0</ns>
  <id>12358595</id>
  <revision>
    <id>548599130</id>
    <parentid>540650189</parentid>
[...]

Not easy to paste here becaus i use tmux and clipboards can be hell if it doesnt fit in one screen but trust me that the article missing is the expected one and the prev and next are correct and intact.

fakedrake commented 10 years ago

Anyway the reason this is happening is beyond me but since I managed to find a working solution I will stick with it.

infolab-csail / wikipedia-mirror