gnosygnu / xowa

xowa offline wiki application
Other
376 stars 40 forks source link

commons import . sql error . suspect ('commons.wikimedia.org' , 'wiki.categorylinks'); cmd_ #454

Closed gettimothy closed 5 years ago

gettimothy commented 5 years ago

First, thanks for your software.

I have taken your "dirty" gfs script from http://xowa.org/home/wiki/Dev/Command-line/Dumps and broken it into sections so I can get a sense of things and isolate problems as they occur. I have successfully imported wikikdata (!) and am now importing commons.

The section of gfs script I am using is:

//java -jar xowa_maven.jar --cmd_file src/main/resources/cmd_commons.gfs  --app_mode cmd
app.bldr.pause_at_end_('n');
app.scripts.run_file_by_type('xowa_cfg_app');
app.cfg.set_temp('app', 'xowa.app.web.enabled', 'y');
app.cfg.set_temp('app', 'xowa.bldr.db.layout_size.text', '0');
app.cfg.set_temp('app', 'xowa.bldr.db.layout_size.html', '0');
app.cfg.set_temp('app', 'xowa.bldr.db.layout_size.file', '0');
app.bldr.cmds {
  // build commons database; this only needs to be done once, whenever commons is updated
  add     ('commons.wikimedia.org' , 'util.cleanup')          {delete_all = 'y';}
  add     ('commons.wikimedia.org' , 'util.download')         {dump_type = 'pages-articles';}
  add     ('commons.wikimedia.org' , 'util.download')         {dump_type = 'categorylinks';}
  add     ('commons.wikimedia.org' , 'util.download')         {dump_type = 'page_props';}
  add     ('commons.wikimedia.org' , 'util.download')         {dump_type = 'image';}

  add     ('commons.wikimedia.org' , 'text.init');
  add     ('commons.wikimedia.org' , 'text.page');
  add     ('commons.wikimedia.org' , 'text.term');
  add     ('commons.wikimedia.org' , 'text.css');
  add     ('commons.wikimedia.org' , 'wiki.page_props');

  add     ('commons.wikimedia.org' , 'wiki.categorylinks');
  add     ('commons.wikimedia.org' , 'wiki.image');
  add     ('commons.wikimedia.org' , 'file.page_regy')        {build_commons = 'y'}
  add     ('commons.wikimedia.org' , 'wiki.page_dump.make');
  add     ('commons.wikimedia.org' , 'wiki.redirect')         {commit_interval = 1000; progress_interval = 100; cleanup_interval = 100;}
  //add     ('commons.wikimedia.org' , 'util.cleanup')          {delete_tmp = 'n'; delete_by_match('*.xml|*.sql|*.bz2|*.gz');}
}
app.bldr.run;

Download, unzip and database creation/extraction have run and a truncated list of files on my system is:

commons.wikimedia.org-file-core.xowa
commons.wikimedia.org-file-user.xowa
commons.wikimedia.org-text-ns.000.xowa
commons.wikimedia.org-text-ns.003.xowa
commons.wikimedia.org-text-ns.004-db.002.xowa
commons.wikimedia.org-text-ns.004.xowa

commons.wikimedia.org-text-ns.006-db.002.xowa  (??where is 001.xowa??)
...through....
commons.wikimedia.org-text-ns.006-db.030.xowa
commons.wikimedia.org-text-ns.006.xowa
commons.wikimedia.org-text-ns.008.xowa
commons.wikimedia.org-text-ns.010.xowa
commons.wikimedia.org-text-ns.012.xowa
commons.wikimedia.org-text-ns.014.xowa
commons.wikimedia.org-text-ns.100.xowa
commons.wikimedia.org-text-ns.102.xowa
commons.wikimedia.org-text-ns.104.xowa
commons.wikimedia.org-text-ns.106.xowa
commons.wikimedia.org-text-ns.1198.xowa
commons.wikimedia.org-text-ns.2600.xowa
commons.wikimedia.org-text-ns.460.xowa
commons.wikimedia.org-text-ns.486.xowa
commons.wikimedia.org-text-ns.490.xowa
commons.wikimedia.org-text-ns.828.xowa
commons.wikimedia.org-xtn.category.core.xowa
commons.wikimedia.org-xtn.category.link-db.001.xowa
...through...
commons.wikimedia.org-xtn.category.link-db.035.xowa
commonswiki-latest-categorylinks.sql
commonswiki-latest-image.sql.gz
commonswiki-latest-pages-articles.xml
commonswiki-latest-pages-articles.xml.bz2
xowa.temp.category.sqlite3

The error is:

with error while executing script: err=[err 0] <org.sqlite.SQLiteException> [SQLITE_CORRUPT]  The database disk image is malformed (database disk image is malformed)
[err 1] <db> db.engine:exec failed: url=data source=/mnt/tmp/xowa_maven/wiki/commons.wikimedia.org/commons.wikimedia.org-core.xowa;version=3; sql=UPDATE  page
SET     page_cat_db_id = 80
WHERE   page_id IN (SELECT cl_from FROM link_db.cat_link WHERE cl_from = page.page_id);
[err 2] <bldr> unknown error: key=wiki.categorylinks
[err 3] <bldr> unknown error
[trace]:
  org.sqlite.core.DB.newSQLException(DB.java:909)
  org.sqlite.core.DB.newSQLException(DB.java:921)
  org.sqlite.core.DB.throwex(DB.java:886)
  org.sqlite.core.NativeDB._exec_utf8(Native Method)
  org.sqlite.core.NativeDB._exec(NativeDB.java:87)
  org.sqlite.jdbc3.JDBC3Statement.executeUpdate(JDBC3Statement.java:116)
  gplx.dbs.engines.Db_engine_sql_base.Exec_as_int(Db_engine_sql_base.java:11...etc

This has me stumped. I believe the failing task is add ('commons.wikimedia.org' , 'wiki.categorylinks'); because when I woke up to the error, the output preceding it looked like something I would expect add ('commons.wikimedia.org' , 'wiki.categorylinks'); to produce

I poked around a bit, and it looks like link_db.cat_link may be an alias for xowa.temp.category.sqlite3 , but I don't know this.

I have successfully opened xowa.temp.category.sqlite3 with sqlitebrowser and many millions of records are in it.

If you could point me where to look, I will be happy to debug this for you.

A possible wildcard in this is that I bought an 8 TB external disk for this work and it seems slower than an internal disk. Perhaps there is a latency issue.

Thank you for your time.

t

gnosygnu commented 5 years ago

Hey, thanks for all the kind words.

You can try running the wiki.categorylinks step by itself.

This should generate a fresh set of xtn.category.core / xtn.category.link databases

If it still fails, send me the output from /xowa/user/anonymous/app/tmp/session as well as the console output.

Finally, you can always try https://www.sqlite.org/pragma.html#pragma_integrity_check

Hope this helps, and best of luck.

gettimothy commented 5 years ago

thx, will do.

gettimothy commented 5 years ago

ok...that error went away, but got a new one.

got a bunch of missing db messages as seen in first line of output below. Then it spent hours trying to index well over 200,000,000 elements. (I think it was closer to 3 hundred million) then it threw the below error:

wiki.db:missing db; tid=xtn.category.link url=/mnt/tmp/xowa_maven/wiki/commons.wikimedia
...
...many of these when I kicked of the script after removing the files per your 
...

(the categorylink indexing was occuring here for several hours and then:)

error while executing script: err=[err 0] <gplx> error while generating catlink dbs: err=[err 0] <java.lang.ClassCastException> class gplx.dbs.engines.noops.Noop_conn_info cannot be cast to class gplx.dbs.engines.sqlite.Sqlite_conn_info (gplx.dbs.engines.noops.Noop_conn_info and gplx.dbs.engines.sqlite.Sqlite_conn_info are in unnamed module of loader 'app')     [trace]:          gplx.dbs.engines.sqlite.Sqlite_conn_info.To_url(Sqlite_conn_info.java:67)       gplx.dbs.Db_attach_itm.<init>(Db_attach_itm.java:28)    gplx.xowa.addons.wikis.ctgs.bldrs.Xob_catlink_wkr.Make_catlink_dbs(Xob_catlink_wkr.java:31)     gplx.xowa.addons.wikis.ctgs.bldrs.Xob_catlink_mgr.On_cmd_end(Xob_catlink_mgr.java:102)      gplx.xowa.addons.wikis.ctgs.bldrs.Xob_catlink_cmd.Cmd_end(Xob_catlink_cmd.java:59)      gplx.xowa.bldrs.Xob_bldr.Run(Xob_bldr.java:192)         gplx.xowa.bldrs.Xob_bldr.Invk(Xob_bldr.java:237)        gplx.langs.gfs.GfsCore_.Exec(GfsCore_.java:31)      gplx.langs.gfs.GfsCore_.Exec(GfsCore_.java:64)          gplx.langs.gfs.GfsCore_.Exec(GfsCore_.java:64)          gplx.langs.gfs.GfsCore.ExecOne_to(GfsCore.java:82)      gplx.xowa.apps.gfs.Xoa_gfs_mgr.Run_str_for(Xoa_gfs_mgr.java:86)         gplx.xowa.apps.gfs.Xoa_gfs_mgr.Run_str_for(Xoa_gfs_mgr.java:77)    gplx.xowa.apps.gfs.Xoa_gfs_mgr.Run_url_for(Xoa_gfs_mgr.java:69)          gplx.xowa.apps.gfs.Xoa_gfs_mgr.Run_url(Xoa_gfs_mgr.java:61)     gplx.xowa.apps.boots.Xoa_boot_mgr.Run_app(Xoa_boot_mgr.java:133)        gplx.xowa.apps.boots.Xoa_boot_mgr.Run(Xoa_boot_mgr.java:38)     gplx.xowa.Xoa_app_.Run(Xoa_app_.java:28)   gplx.xowa.Xowa_main.main(Xowa_main.java:22)
[err 1] <bldr> unknown error: key=wiki.categorylinks
[err 2] <bldr> unknown error
[trace]:
  gplx.xowa.addons.wikis.ctgs.bldrs.Xob_catlink_mgr.On_cmd_end(Xob_catlink_mgr.java:106)
  gplx.xowa.addons.wikis.ctgs.bldrs.Xob_catlink_cmd.Cmd_end(Xob_catlink_cmd.java:59)
  gplx.xowa.bldrs.Xob_bldr.Run(Xob_bldr.java:192)
  gplx.xowa.bldrs.Xob_bldr.Invk(Xob_bldr.java:237)
  gplx.langs.gfs.GfsCore_.Exec(GfsCore_.java:31)
  gplx.langs.gfs.GfsCore_.Exec(GfsCore_.java:64)
  gplx.langs.gfs.GfsCore_.Exec(GfsCore_.java:64)
  gplx.langs.gfs.GfsCore.ExecOne_to(GfsCore.java:82)
  gplx.xowa.apps.gfs.Xoa_gfs_mgr.Run_str_for(Xoa_gfs_mgr.java:86)
  gplx.xowa.apps.gfs.Xoa_gfs_mgr.Run_str_for(Xoa_gfs_mgr.java:77)
  gplx.xowa.apps.gfs.Xoa_gfs_mgr.Run_url_for(Xoa_gfs_mgr.java:69)
  gplx.xowa.apps.gfs.Xoa_gfs_mgr.Run_url(Xoa_gfs_mgr.java:61)
  gplx.xowa.apps.boots.Xoa_boot_mgr.Run_app(Xoa_boot_mgr.java:133)
  gplx.xowa.apps.boots.Xoa_boot_mgr.Run(Xoa_boot_mgr.java:38)
  gplx.xowa.Xoa_app_.Run(Xoa_app_.java:28)
  gplx.xowa.Xowa_main.main(Xowa_main.java:22)

This is just a FYI, as I think I can just hack the wikidatawiki-latest-categorylinks.sql directly into postgres.

The /xowa/user/anonymous/app/tmp/session is an empty directory. I poked around in app/tmp/log and xolog, but nothing seemed pertinent.

Thank you for your time.

gnosygnu commented 5 years ago

Oops. I see the problem.

The files you moved / deleted are still in commons.wikimedia.org-file-core.xowa in the xowa_db table

This causes this error (which is harmless)

wiki.db:missing db; tid=xtn.category.link url=/mnt/tmp/xowa_maven/wiki/commons.wikimedia
...
...many of these when I kicked of the script after removing the files per your 
...

But it causes this error (which fails the operation)

error while executing script: err=[err 0] <gplx> error while generating catlink dbs: err=[err 0] <java.lang.ClassCastException> class gplx.dbs.engines.noops.Noop_conn_info cannot be cast to class gplx.dbs.engines.sqlite.Sqlite_conn_info (gplx.dbs.engines.noops.Noop_conn_info and gplx.dbs.engines.sqlite.Sqlite_conn_info are in unnamed module of loader 'app')     [trace]:          gplx.dbs.engines.sqlite.Sqlite_conn_info.To_url(Sqlite_conn_info.java:67)       gplx.dbs.Db_attach_itm.<init>(Db_attach_itm.java:28)    gplx.xowa.addons.wikis.ctgs.bldrs.Xob_catlink_wkr.Make_catlink_dbs(Xob_catlink_wkr.java:31)

Actually, you don't need to run add ('commons.wikimedia.org' , 'wiki.categorylinks'); My script omits it: http://xowa.org/home/wiki/Dev/Command-line/Dumps#Script:_gnosygnu.27s_actual_English_Wikipedia_script_.28dirty.3B_provided_for_reference_only.29

If you do want to run it, then you can try removing the first batch of files from the xowa_db table with DELETE FROM xowa_db WHERE db_type IN (6, 7) .

If that fails, you can also restart the whole import from scratch. But IMHO, categories is probably not worth it

Hope this helps. Thanks!

gettimothy commented 5 years ago

Thank you for the reply.

I was able to import the categories directly into postgres by hacking the mysql dump file. (I did this on wikidata, I will probably do on commons too.

To summarize, the second one failed because of a "bookeeping" error in the databases you populate for keeping track of which database has what. By resetting the "bookeeping" it would have finished.

That is good to know going forward.

thx for your time.