infolab-csail / wikipedia-mirror

Makefiles that will download and setup a local wikipedia instance.
1 stars 2 forks source link

Serial load in database looks broken #8

Closed fakedrake closed 10 years ago

fakedrake commented 10 years ago

I checked this morning the output and the tmux terminal buffer was filled with this

ERROR 1114 (HY000) at line 33690: The table 'revision' is full
ERROR 1114 (HY000) at line 33691: The table 'text' is full
ERROR 1114 (HY000) at line 33692: The table 'page' is full
ERROR 1114 (HY000) at line 33695: The table 'text' is full
ERROR 1114 (HY000) at line 33696: The table 'text' is full
ERROR 1114 (HY000) at line 33697: The table 'text' is full
ERROR 1114 (HY000) at line 33698: The table 'revision' is full
ERROR 1114 (HY000) at line 33699: The table 'text' is full
ERROR 1114 (HY000) at line 33700: The table 'page' is full
ERROR 1114 (HY000) at line 33703: The table 'text' is full
ERROR 1114 (HY000) at line 33704: The table 'text' is full
ERROR 1114 (HY000) at line 33705: The table 'revision' is full
ERROR 1114 (HY000) at line 33706: The table 'text' is full
ERROR 1114 (HY000) at line 33707: The table 'page' is full
ERROR 1114 (HY000) at line 33710: The table 'text' is full
ERROR 1114 (HY000) at line 33711: The table 'text' is full
ERROR 1114 (HY000) at line 33712: The table 'revision' is full
ERROR 1114 (HY000) at line 33713: The table 'text' is full
ERROR 1114 (HY000) at line 33714: The table 'page' is full
ERROR 1114 (HY000) at line 33717: The table 'text' is full
ERROR 1114 (HY000) at line 33718: The table 'text' is full
ERROR 1114 (HY000) at line 33719: The table 'revision' is full
ERROR 1114 (HY000) at line 33720: The table 'text' is full
ERROR 1114 (HY000) at line 33721: The table 'page' is full
ERROR 1114 (HY000) at line 33724: The table 'text' is full
ERROR 1114 (HY000) at line 33725: The table 'text' is full
ERROR 1114 (HY000) at line 33726: The table 'revision' is full
ERROR 1114 (HY000) at line 33727: The table 'text' is full
ERROR 1114 (HY000) at line 33728: The table 'page' is full
ERROR 1114 (HY000) at line 33731: The table 'text' is full
ERROR 1114 (HY000) at line 33732: The table 'text' is full
ERROR 1114 (HY000) at line 33733: The table 'revision' is full
ERROR 1114 (HY000) at line 33734: The table 'text' is full
ERROR 1114 (HY000) at line 33735: The table 'page' is full
ERROR 1114 (HY000) at line 33738: The table 'text' is full
ERROR 1114 (HY000) at line 33739: The table 'text' is full
ERROR 1114 (HY000) at line 33740: The table 'revision' is full
ERROR 1114 (HY000) at line 33741: The table 'text' is full
ERROR 1114 (HY000) at line 33742: The table 'page' is full

However all the dumps look like they succeeded

$ ls *loaded
enwiki-20131202-pages-articles10.xml-p000925001p001325000.sql-loaded  enwiki-20131202-pages-articles23.xml-p018225004p020925000.sql-loaded
enwiki-20131202-pages-articles11.xml-p001325001p001825000.sql-loaded  enwiki-20131202-pages-articles24.xml-p020925002p023724999.sql-loaded
enwiki-20131202-pages-articles12.xml-p001825001p002425000.sql-loaded  enwiki-20131202-pages-articles25.xml-p023725001p026624997.sql-loaded
enwiki-20131202-pages-articles13.xml-p002425002p003124997.sql-loaded  enwiki-20131202-pages-articles26.xml-p026625004p029624976.sql-loaded
enwiki-20131202-pages-articles14.xml-p003125001p003924999.sql-loaded  enwiki-20131202-pages-articles27.xml-p029625017p041249406.sql-loaded
enwiki-20131202-pages-articles15.xml-p003925001p004824998.sql-loaded  enwiki-20131202-pages-articles2.xml-p000010002p000024999.sql-loaded
enwiki-20131202-pages-articles16.xml-p004825005p006024996.sql-loaded  enwiki-20131202-pages-articles3.xml-p000025001p000055000.sql-loaded
enwiki-20131202-pages-articles17.xml-p006025001p007524997.sql-loaded  enwiki-20131202-pages-articles4.xml-p000055002p000104998.sql-loaded
enwiki-20131202-pages-articles18.xml-p007525004p009225000.sql-loaded  enwiki-20131202-pages-articles5.xml-p000105002p000184999.sql-loaded
enwiki-20131202-pages-articles19.xml-p009225002p011124997.sql-loaded  enwiki-20131202-pages-articles6.xml-p000185003p000305000.sql-loaded
enwiki-20131202-pages-articles1.xml-p000000010p000010000.sql-loaded   enwiki-20131202-pages-articles7.xml-p000305002p000464996.sql-loaded
enwiki-20131202-pages-articles21.xml-p013325003p015724999.sql-loaded  enwiki-20131202-pages-articles8.xml-p000465001p000665000.sql-loaded
enwiki-20131202-pages-articles22.xml-p015725013p018225000.sql-loaded  enwiki-20131202-pages-articles9.xml-p000665001p000925000.sql-loaded

I tried a couple of articles and it works (see issue #6 )

But we are definitely missing stuff...

$ tac enwiki-20131202-pages-articles27.xml-p029625017p041249406.fix.xml| gre
p -m 1 "<title>" -n
19:    <title>San Diego Boca FC>/title>
$ curl -I http://futuna.csail.mit.edu:8080/mediawiki/San_Diego_Boca_FC
HTTP/1.1 404 Not Found
Date: Sun, 04 May 2014 14:49:24 GMT
Server: Apache
X-Frame-Options: SAMEORIGIN
X-Powered-By: PHP/5.4.26
X-Content-Type-Options: nosniff
Content-language: en
Vary: Accept-Encoding,Cookie
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Cache-Control: private, must-revalidate, max-age=0
Content-Type: text/html; charset=UTF-8

For brevity: San Diego FC

First milestone for this is to find out exactly how many articles we are missing.

fakedrake commented 10 years ago

Mystery solved

$ df /scratch
Filesystem                    1K-blocks      Used Available Use% Mounted on
/dev/mapper/vg_system-scratch 639957424 607442764         0 100% /scratch
fakedrake commented 10 years ago

The data takes up about 3-4x the pure data size bacause I keep all steps of the process zipped xml dumps -> xml dumps -> sql dumps -> database, in case one of them goes wrong. Since space is obviously an issue (at least in the development environment of wikipedia-mirror) i will definitely need to delete some of them.

Sql dumps are the most expensive to generate so I guess those are out of the question I think. Xmls are pretty cheap so those will probably go. I will try to make it work keeping the zipped ones .

fakedrake commented 10 years ago

With enough space it works:

 $ curl -I http://futuna.csail.mit.edu:8080/mediawiki/San_Diego_Boca_FC
HTTP/1.1 200 OK
Date: Thu, 08 May 2014 00:57:39 GMT
Server: Apache
X-Frame-Options: SAMEORIGIN
Cache-Control: max-age=0, no-cache
Content-Type: text/html; charset=UTF-8

Now all data is in the database all we need is the wikipedia extensions to do the rendering correctly.