Closed fakedrake closed 10 years ago
First of all I will try exporting it to xml to get an idea of where the error. The final line was in the Cranopsis bocourti article. The next article is Amietophrynus brauni
The xerces Java github repo (clone it, impl is not visible online for some reason) does not yield too much information.
The commands that are not advised if you are running on a slow machine or if the files you are dealing with are much larger than mine I prepend with SLOW. For each one I will provide sooner or later smarter ways of getting the same or similar results.
To find out where the process fails here is what I do:
$ java -jar /scratch/cperivol/wikipedia-mirror/tools/mwdumper.jar --format=xml /scratch/cperivol/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20131202-pages-articles20.xml-p011125004p013324998.fix.xml > /tmp/just-a-copy.xml
The same error as above should be yielded (it is a problem with reading). Then I can find the last two articles that mdumper spit out with
$ tac /tmp/just-a-copy.xml | grep "<title>" -m 2 | tac
<title>The roaring 20s</title>
<title>Cranopsis bocourti</title> # <- This is the last one
At this point I will do
$ export ORIGINAL_XML=/scratch/cperivol/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20131202-pages-articles20.xml-p011125004p013324998.fix.xml
So I dont have to type the name each time and so the commands match regardless of the path (better late than never). This is a fast operation because we read the file from the end back.
Then if I am brave and strong I can find the article in the original xml file with
SLOW
$ grep "<title>Cranopsis bocourti</title>" -A 200 -B 100 $ORIGINAL_XML | less
| less
is to browse and -A 200 -B 100
is 200 lines after and 100
before the point. Now everything seems to be normal, which is
extremely weird. Later I will demonstrate how to inspect the area in a
more efficient way.
I want to take a look at the structure of the XML document. Specifically I want to see what are the parents of title. I am quite lucky that the generated xml is indented so let's use that to find all the parent tags of the cut-in-the-middle article. First let's see how deep we are right now:
$ tac /tmp/just-a-copy.xml | grep "^ *<[^/]" -m 1 | tac
<text xml:space="preserve"><!-- This article was auto-generated by [[User:Polbot]]. -->
I count 6 spaces so:
CAUTION This is pretty expensive. I could be smarter by looking at
the head of the file with head $ORIGINAL_XML
to determine level
0 and then run the command only for layer 2, 4 and 6. This would be
much faster in cpu time and I would prefer it if I had the entire
dump in a single file but right now I am not lacking in
resources. So here is me being lazy.
SLOW
$ for i in {0..6}; do echo "Level $i:"; tac /tmp/just-a-copy.xml | grep "^ \{$i\}<[^/]" -m 1 -n | tac; done
Level 0:
17564960:<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.3/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.3/ http://www.mediawiki.org/xml/export-0.3.xsd" version="0.3" xml:lang="en">
Level 1:
Level 2:
38: <page>
Level 3:
Level 4:
35: <revision>
Level 5:
Level 6:
26: <text xml:space="preserve"><!-- This article was auto-generated by [[User:Polbot]]. -->
Looks like it's just pages thrown in a grand domain called mediawiki. We could have seen that from the java source too but as expensive as this is, it is much faster than dealing with the source of a project I have no idea about.
So anyway, the line numbers denote how far the line is from the end of
the file. I will try to see if this is a local problem by trying to
parse just this page or if it chokes on it based on it's environment.
I see that I can actually afford to just throw away the siblings at
the first level of the tree as <page>
at level 1 is only 38 lines
away and I expect </page>
to be similarly close.
Now the obvious way of doing something like that is awk but let's see if we can avoid such an expensive operation. First of all the original and the generated xml have some, not very substatntial, but scary differences:
$ cmp /tmp/just-a-copy.xml $ORIGINAL_XML
/tmp/just-a-copy.xml /scratch/cperivol/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20131202-pages-articles20.xml-p011125004p013324998.fix.xml differ: byte 2, line 1
That was early. And also there is no real way around it
$ head $ORIGINAL_XMLWikipedia http://en.wikipedia.org/wiki/Main_Page MediaWiki 1.23wmf4 first-letter Media Special $ head /tmp/just-a-copy.xml Wikipedia http://en.wikipedia.org/wiki/Main_Page MediaWiki 1.23wmf4 first-letter Media
Looks similar but the generated misses a couple of xml attributes and has different versions. Let's see if things are anywhere ear where we expect them based on the generated file. Well we know the generated file is 17564960+1 lines long If you didn't run it you can find that out with
$ wc -l /tmp/just-a-copy.xml
17564961 /tmp/just-a-copy.xml
Which shouldn't take more than 10-20s in the worst case (for me it's almost instant). So let's see what's on the supposed last line of just-a-copy.xml
$ sed "17564960q;d" $ORIGINAL_XML
[[Willie Jones (American football)|Willie Jones]],
Football: nothing to do with frogs (at least we ar n ot within 20-30 lines) so instead of spending too much effort looking into fine tuning line numbers I will employ a more general method.
First of looking at lines is slow, looking at bytes gets the kernel to do stuff which is always a good idea. So I will first find at which byte is the article I am interested in.
$ grep -b "<title>Cranopsis bocourti</title>" -m 1 $ORIGINAL_XML
1197420547: <title>Cranopsis bocourti</title>
This may take a little while but you are stuck with it unfortunately. So let's get the bottom half
$ dd if=$ORIGINAL_XML skip=1197420547 ibs=1 | sed -n '/<\/page>/{p;q};p' > /tmp/original_tail.xml
Note that you could play with ibs and skip values but even this for me was instantaneous
Now I want the head of the page tag that contains the
byte 1197420547. Unfortunately dd
will not read in reverse. It will
however let me take a specific portion of the file. 1K is easily
manageable by tac
and sed
even on embedded devices so I will try
with that.
$ dd if=$ORIGINAL_XML count=1000 skip=$((1197420547-1000)) ibs=1 | tac | sed '/<page>/q' | tac > /tmp/original_head.xml
It turns out we just needed about a dozen bytes as the tag was exactly above but if it wasn't it would be a good idea to just jump to a number we find manageable.
So anyway, we have our head tag from the loop command which is at the
top so to create just the page
and mediawiki
tage
$ head -1 $ORIGINAL_XML | cat - /tmp/original_page.xml > /tmp/original_lite.xml; echo "</mediawiki>" >> /tmp/original_lite.xml
And there you have it.
So here is the result from just the page:
$ java -jar /scratch/cperivol/wikipedia-mirror/tools/mwdumper.jar --format=xml /tmp/original_lite.xml Exception in thread "main" java.lang.NullPointerException at org.mediawiki.importer.XmlDumpReader.readTitle(XmlDumpReader.java:317) at org.mediawiki.importer.XmlDumpReader.endElement(XmlDumpReader.java:214) at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source) at javax.xml.parsers.SAXParser.parse(SAXParser.java:392) at javax.xml.parsers.SAXParser.parse(SAXParser.java:195) at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88) at org.mediawiki.dumper.Dumper.main(Dumper.java:142)
Nice! Wait what? A different error. Woops. Before we continue let's take a look at the head of the original XML
head -200 $ORIGINAL_XML | less
a <siteinfo>
tag. Great. Let's try adding that to out file
$ (head -1 /tmp/original_lite.xml; sed -n "/<siteinfo>/,/<\/siteinfo>/p;/<\/siteinfo>/q" $ORIGINAL_XML ; tail -n+1 /tmp/original_lite.xml) > /tmp/original_lite.xml
This just uses sed to get just the siteinfo
and puts it right after the first line of the file. If you run it twice by mistake and have no proper editor at hand (like me)
$ dd if=$ORIGINAL_XML skip=1197420547 ibs=1 | sed -n '/<\/page>/{p;q};p' > /tmp/original_tail.xml
$ dd if=$ORIGINAL_XML count=1000 skip=$((1197420547-1000)) ibs=1 | tac | sed '/<page>/q' | tac > /tmp/original_head.xml
Here is a compact version of creating the original_lite.xml
file. Please consider that I am not mad, it's just that experimentin with stuff I broke orginial_lite.xml a lot so this came in handy
(head -1 $ORIGINAL_XML; sed -n "/<siteinfo>/,/<\/siteinfo>/p;/<\/siteinfo>/q" $ORIGINAL_XML ; cat /tmp/original_head.xml ; cat /tmp/original_tail.xml; tail -1 $ORIGINAL_XML ) > /tmp/original_lite.xml
To my horror it actually works. The command is for quick reference
$ java -jar /scratch/cperivol/wikipedia-mirror/tools/mwdumper.jar --format=xml /tmp/original_lite.xml
and no error is yielded. I will make a script to easily incrementally add pages untill something breaks.
I composed all of the above in a script (it looks small because most of the work we did was investigative) edit: cool new version that gets the article titles as arguments. It will take a while if they do not exist.
#!/bin/bash # # Dont really parse it, just do some nassty tricks to make a subset of # the xml that makes sense ORIGINAL_XML=/scratch/cperivol/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20131202-pages-articles20.xml-p011125004p013324998.fix.xml # Throw page in stdout function xml_page { term="$@ " title_offset=$(grep -b -F "$term" -m 1 $ORIGINAL_XML | grep -o "[0-9]*" | head -1) if [ ! $title_offset ]; then echo "Found '$title_offset' Grep-ing (grep -b -F \"$term\" -m 1 $ORIGINAL_XML | grep -o '[0-9]*')" grep -b -F "$term" -m 1 $ORIGINAL_XML | grep -o "[0-9]*" exit 0 fi count=1000 if [[ $title_offset -lt 1000 ]]; then count=$title_offset fi dd if=$ORIGINAL_XML count=$count skip=$(($title_offset-$count)) ibs=1 | tac | sed '//q' dd if=$ORIGINAL_XML skip=$title_offset ibs=1 | sed -n '/<\/page>/{p;q};p' } # Put stdin betwinn mediawiki tags and into stdout function mediawiki_xml { (head -1 $ORIGINAL_XML; sed -n "/ /,/<\/siteinfo>/p;/<\/siteinfo>/q" $ORIGINAL_XML ; cat - ; tail -1 $ORIGINAL_XML ) } (for i; do xml_page "$i"; done) | mediawiki_xml
I a file with the last 4 proessed articles to stdout with:
$ tac /tmp/just-a-copy.xml | grep -F "<title>" | head -4 | tac | sed 's/ *<title>\(.*\)<\/title>/\1/' | xargs -d '\n' bash data/xml-parse.sh
20 articles did not cause any problem... Neither did 500 but the entire thing is 377032 pages so better luck next time. I dont think however that we died of the information bulk
$ tac drafts/wikipedia-parts/enwiki-20131202-pages-articles19.xml-p009225002p011124997.fix.xml | grep -F "<title>" | wc -l
476679
So now I have a script that actually removes entire pages but it uses the std pipes so it can be quite slow even with dd (4.8MB/s makes about 5-6mins for a very simple case). I also do not expect to realy on infinite storage. That's why I wrote a small C program that covers regions of a file with spaces and that is done in-place so it should be very much faster.
I was actually a bit surprised to find out that it made NO difference wether dd copied stuff internally or things were thrown through stdout
( neg_xml_page "Cranopsis bocourti" > /tmp/huge.xml; ) 79.10s user 368.78s system 98% cpu 7:36.15 total
And
( neg_xml_page "Cranopsis bocourti" /tmp/huge.xml; ) 78.51s user 369.86s system 97% cpu 7:37.92 total
Wow! And not only that the second is much harder to implement when I am trying to chain mpore than one dd calls
I find it extremely weird that this actually worked. I should feel happy and liberated from the problem I couldn't understand but now I feel alone, confused and in need of a mars bar. I also worder if I should actually try my C program to see if it works aswell...
just to document the commands used:
. data/xml-parse.sh && time (neg_xml_page "Cranopsis bocourti" > /tmp/huge.xml)
java -jar tools/mwdumper.jar --format=xml /tmp/huge.xml > /dev/null
(Obviously hange xml
to sql:1.5
and /dev/null
to a file)
To clean things up for article 20 to retest the c code in the large set you can
bzcat -dv /scratch/cperivol/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20131202-pages-articles20.xml-p011125004p013324998.bz2 > /scratch/cperivol/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20131202-pages-articles20.xml-p011125004p013324998.raw.xml
Remember the original "inevitable" 7-8 minutes to clean out a page. Turns out covering it takes less than a sec:
$ time data/xml-parse.sh $ORIGINAL_XML Cranopsis\ bocourti inplace Method: inplace search term:Cranopsis bocourti title offset: 1197420551 1000+0 records in 1+1 records out 1000 bytes (1.0 kB) copied, 0.000432871 s, 2.3 MB/s to page start: 13 to page end: 1972 page start: 1197420538 page end: 1197422523, bytes to copy: 2181880699 Using in place covering with /scratch/cperivol/wikipedia-mirror/data/page_remover.. Running: /scratch/cperivol/wikipedia-mirror/data/page_remover /scratch/cperivol/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20131202-pages-articles20.xml-p011125004p013324998.fix.xml 1197420538 1985 Opening /scratch/cperivol/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20131202-pages-articles20.xml-p011125004p013324998.fix.xml at 1197420538 (len: 1985) data/xml-parse.sh $ORIGINAL_XML Cranopsis\ bocourti inplace 0.32s user 0.14s system 90% cpu 0.508 total
That however did not fix the error...
So let's see if it did what we expected it to do
$ grep -n -F "Cranopsis bocourti " -m 1 $ORIGINAL_XML 19529323:Cranopsis bocourti
and then
$ sed -n "19529123,19529423p" $ORIGINAL_XML | less
Shows that the blanks were actually inserted correctly...
[...]
<page>
<title>The roaring 20s</title>
<ns>0</ns>
<id>12358588</id>
<redirect title="Roaring Twenties" />
<revision>
<id>146066361</id>
<timestamp>2007-07-21T04:40:06Z</timestamp>
<contributor>
<username>DBaba</username>
<id>1322148</id>
</contributor>
<comment>[[WP:AES|←]]Redirected page to [[Roaring Twenties]]</comment>
<text xml:space="preserve">#redirect [[Roaring Twenties]]</text>
<sha1>0gi4dhdbm0pcyvjlm5q0stejqnk3wpy</sha1>
<model>wikitext</model>
<format>text/x-wiki</format>
</revision>
</page>
<page>
<title>Amietophrynus brauni</title>
<ns>0</ns>
<id>12358595</id>
<revision>
<id>548599130</id>
<parentid>540650189</parentid>
[...]
Not easy to paste here becaus i use tmux and clipboards can be hell if it doesnt fit in one screen but trust me that the article missing is the expected one and the prev and next are correct and intact.
Anyway the reason this is happening is beyond me but since I managed to find a working solution I will stick with it.
The failing command is
Also dont worry too much about running it because the time is