gnosygnu / xowa

xowa offline wiki application
Other
375 stars 41 forks source link

Make: UNIQUE constraint failed: hxtn_page.page_id for enwiki (question: how long do you expect 'creating db index' to take) #526

Open desb42 opened 5 years ago

desb42 commented 5 years ago

I have just started the rebuild of 2019-06-01 html the mass.parse phase has gone very well and taken 3.5 days with very few #invoke errors

The process is onto the next phase and on-screen is the line

creating db index (please wait); db=xomp_wkr idx=hxtn_blob__pkey

From the timestamps on the xomp_wkr.sqlite3 files, the mass parse finished about 01:12 today It is now 06:35 (approx 5 hours later)

Looking at Windows 10 task manager, I see little cpu usage and little disk usage I would expect that building an index would be intensive of both (more disk)

First: Is this what you expect If so, how long should it take (another three days?)

Second: If I were to break in, can I restart it without the mass parse

desb42 commented 5 years ago

looking at the logs (see #527) for the log, the last SQLITE error may be indicative?

mass_parse.fail_end; err=[err 0] <org.sqlite.SQLiteException> [SQLITE_CONSTRAINT_UNIQUE]
  A UNIQUE constraint failed (UNIQUE constraint failed: hxtn_page.page_id, hxtn_page.wkr_id, hxtn_page.data_id)
  [err 1] <db> db.engine:exec failed: url=data
   source=g:\xowa\wiki\en.wikipedia.org\tmp\xomp\xomp_004\xomp_wkr.sqlite3;version=3; 
   sql=CREATE UNIQUE INDEX IF NOT EXISTS hxtn_page__pkey ON hxtn_page (page_id, wkr_id, data_id);
  [trace]:
    org.sqlite.core.DB.newSQLException(DB.java:909)
    org.sqlite.core.DB.newSQLException(DB.java:921)
    org.sqlite.core.DB.throwex(DB.java:886)
    org.sqlite.core.NativeDB._exec_utf8(Native Method)
    org.sqlite.core.NativeDB._exec(NativeDB.java:87)
    org.sqlite.jdbc3.JDBC3Statement.executeUpdate(JDBC3Statement.java:116)
    gplx.dbs.engines.Db_engine_sql_base.Exec_as_int(Db_engine_sql_base.java:48)
    gplx.dbs.engines.Db_engine_sql_base.Meta_idx_create(Db_engine_sql_base.java:69)
    gplx.dbs.Db_conn.Meta_idx_create(Db_conn.java:47)
    gplx.xowa.htmls.hxtns.pages.Hxtn_page_tbl.Stmt_end(Hxtn_page_tbl.java:42)
    gplx.xowa.htmls.hxtns.pages.Hxtn_page_mgr.Insert_end(Hxtn_page_mgr.java:74)
    gplx.xowa.addons.bldrs.mass_parses.parses.wkrs.Xomp_parse_wkr.Exec(Xomp_parse_wkr.java:186)
    gplx.xowa.addons.bldrs.mass_parses.parses.wkrs.Xomp_parse_wkr.Invk(Xomp_parse_wkr.java:213)
    gplx.Gfo_invk_.Invk_by_msg(Gfo_invk_.java:34)
    gplx.core.threads.Thread_adp.run(Thread_adp.java:33)
    java.lang.Thread.run(Thread.java:748)

possibly?

gnosygnu commented 5 years ago

the mass.parse phase has gone very well and taken 3.5 days with very few #invoke errors

Nice, good to hear. I'm trying to rebuild www.wikidata.org (keeps failing around Property 66## out of 68##), but will kick off a run sometime this week.

creating db index (please wait); db=xomp_wkr idx=hxtn_blob__pkey From the timestamps on the xomp_wkr.sqlite3 files, the mass parse finished about 01:12 today It is now 06:35 (approx 5 hours later) Looking at Windows 10 task manager, I see little cpu usage and little disk usage I would expect that building an index would be intensive of both (more disk)

Yeah, something failed. :(

So, in general, there's no way to get progress on SQLite CREATE INDEX (indicated by creating db index (please wait))). Sometimes it takes 1 minute; sometimes it can take 3+ hours. It all depends on how big the database is.

The first test is to check the logs (which you did next).

The simple test is to check the last modified timestamp for the .xowa file within the last 5 minutes. If CREATE INDEX is running, it will still be adding / updating the file regularly. If you see that the file has not updated within 5 minutes then chances are something failed.

First: Is this what you expect If so, how long should it take (another three days?)

It should take less than 30 minutes. As per your next comment, something failed.

Second: If I were to break in, can I restart it without the mass parse

Yup. All the steps were designed to run individually. Once 'wiki.mass_parse.exec completes, you should be able to run

add     ('en.wikipedia.org' , 'wiki.mass_parse.make');

looking at the logs (see #527) for the log, the last SQLITE error may be indicative?

mass_parse.fail_end; err=[err 0] <org.sqlite.SQLiteException> [SQLITE_CONSTRAINT_UNIQUE]
  A UNIQUE constraint failed (UNIQUE constraint failed: hxtn_page.page_id, hxtn_page.wkr_id, hxtn_page.data_id)
  [err 1] <db> db.engine:exec failed: url=data
   source=g:\xowa\wiki\en.wikipedia.org\tmp\xomp\xomp_004\xomp_wkr.sqlite3;version=3; 
   sql=CREATE UNIQUE INDEX IF NOT EXISTS hxtn_page__pkey ON hxtn_page (page_id, wkr_id, data_id);
  [trace]:
    org.sqlite.core.DB.newSQLException(DB.java:909)

Yeah, this isn't good. This is related to TemplateStyles and #482 . I had problems with the UNIQUE INDEX but thought I got them all fixed for 2019-05. It's possible I'm still missing something for 2019-06

At this point, you can try to delete the duplicate rows and re-running wiki.parse_make

To find duplicate rows (have to run out: SQL written without testing. backup your file before running!):

SELECT  hx.page_id || '|' || hx.wkr_id || '|' ||  hx.data_id
FROM hxtn_page hx
GROUP BY hx.page_id || '|' || hx.wkr_id || '|' ||  hx.data_id
HAVING COUNT(hx.page_id || '|' || hx.wkr_id || '|' ||  hx.data_id) > 1

To delete:

DELETE FROM hxtn_page
WHERE (page_id || '|' || wkr_id || '|' ||  data_id) 
IN 
(
SELECT  hx.page_id || '|' || hx.wkr_id || '|' ||  hx.data_id
FROM hxtn_page hx
GROUP BY hx.page_id || '|' || hx.wkr_id || '|' ||  hx.data_id
HAVING COUNT(hx.page_id || '|' || hx.wkr_id || '|' ||  hx.data_id) > 1
)
desb42 commented 5 years ago

The one duplicate was

34556475|0|57940169

this is the page Typing_environment

the delete statement needed some brackets to work

Would it be better to have deleted one of the two entries instead of both?

Having performed the delete (and remembering with my db browser to write the changes back) Restarting at wiki.parse_make has run to completion (including download of images) This phase took about 18h Great!

gnosygnu commented 5 years ago

The one duplicate was 34556475|0|57940169 this is the page Typing_environment

Cool. I'm surprised that was the only one though. Did you have to use add ('en.wikipedia.org' , 'wiki.mass_parse.resume'); during your build ?

Otherwise, I'll keep this issue open until I get a chance to build 2019-06 on my side (and see if the issue is reproducible here)

the delete statement needed some brackets to work

Oops. Wrote that out just before I ran out to catch a train. Updated the comment above

Would it be better to have deleted one of the two entries instead of both?

Another oops. Mistake on my side. Only 1 should be deleted. If you deleted both, it means that TemplateStyles may not render properly for Typing_environment.

Having performed the delete (and remembering with my db browser to write the changes back) Restarting at wiki.parse_make has run to completion (including download of images) This phase took about 18h

And yet another oops. I meant it would take 5 mins for the hxtn INDEX. The entire wiki.parse_make will take something like 18h

Let me know if there's anything else. Otherwise, as said above, will keep this issue open till I run 2019-06 on my side.

Thanks!

desb42 commented 5 years ago

just spent the last four days building hdump of enwiki 2019-08-01 and unfortunately the issue is still there, one duplicate in the page Typing_environment

gnosygnu commented 5 years ago

Hmm... Honestly don't know what's going on here.

I looked at the code, and can't figure out how duplicates occur for that one page. Thought process is below.

Will think about it a little more, but next step would be to add a step to delete dupes before applying the UNIQUE INDEX. Will let you know. Thanks.