Open desb42 opened 5 years ago
looking at the logs (see #527) for the log, the last SQLITE error may be indicative?
mass_parse.fail_end; err=[err 0] <org.sqlite.SQLiteException> [SQLITE_CONSTRAINT_UNIQUE]
A UNIQUE constraint failed (UNIQUE constraint failed: hxtn_page.page_id, hxtn_page.wkr_id, hxtn_page.data_id)
[err 1] <db> db.engine:exec failed: url=data
source=g:\xowa\wiki\en.wikipedia.org\tmp\xomp\xomp_004\xomp_wkr.sqlite3;version=3;
sql=CREATE UNIQUE INDEX IF NOT EXISTS hxtn_page__pkey ON hxtn_page (page_id, wkr_id, data_id);
[trace]:
org.sqlite.core.DB.newSQLException(DB.java:909)
org.sqlite.core.DB.newSQLException(DB.java:921)
org.sqlite.core.DB.throwex(DB.java:886)
org.sqlite.core.NativeDB._exec_utf8(Native Method)
org.sqlite.core.NativeDB._exec(NativeDB.java:87)
org.sqlite.jdbc3.JDBC3Statement.executeUpdate(JDBC3Statement.java:116)
gplx.dbs.engines.Db_engine_sql_base.Exec_as_int(Db_engine_sql_base.java:48)
gplx.dbs.engines.Db_engine_sql_base.Meta_idx_create(Db_engine_sql_base.java:69)
gplx.dbs.Db_conn.Meta_idx_create(Db_conn.java:47)
gplx.xowa.htmls.hxtns.pages.Hxtn_page_tbl.Stmt_end(Hxtn_page_tbl.java:42)
gplx.xowa.htmls.hxtns.pages.Hxtn_page_mgr.Insert_end(Hxtn_page_mgr.java:74)
gplx.xowa.addons.bldrs.mass_parses.parses.wkrs.Xomp_parse_wkr.Exec(Xomp_parse_wkr.java:186)
gplx.xowa.addons.bldrs.mass_parses.parses.wkrs.Xomp_parse_wkr.Invk(Xomp_parse_wkr.java:213)
gplx.Gfo_invk_.Invk_by_msg(Gfo_invk_.java:34)
gplx.core.threads.Thread_adp.run(Thread_adp.java:33)
java.lang.Thread.run(Thread.java:748)
possibly?
the mass.parse phase has gone very well and taken 3.5 days with very few #invoke errors
Nice, good to hear. I'm trying to rebuild www.wikidata.org (keeps failing around Property 66## out of 68##), but will kick off a run sometime this week.
creating db index (please wait); db=xomp_wkr idx=hxtn_blob__pkey From the timestamps on the xomp_wkr.sqlite3 files, the mass parse finished about 01:12 today It is now 06:35 (approx 5 hours later) Looking at Windows 10 task manager, I see little cpu usage and little disk usage I would expect that building an index would be intensive of both (more disk)
Yeah, something failed. :(
So, in general, there's no way to get progress on SQLite CREATE INDEX (indicated by creating db index (please wait))
). Sometimes it takes 1 minute; sometimes it can take 3+ hours. It all depends on how big the database is.
The first test is to check the logs (which you did next).
The simple test is to check the last modified timestamp for the .xowa file within the last 5 minutes. If CREATE INDEX is running, it will still be adding / updating the file regularly. If you see that the file has not updated within 5 minutes then chances are something failed.
First: Is this what you expect If so, how long should it take (another three days?)
It should take less than 30 minutes. As per your next comment, something failed.
Second: If I were to break in, can I restart it without the mass parse
Yup. All the steps were designed to run individually. Once 'wiki.mass_parse.exec
completes, you should be able to run
add ('en.wikipedia.org' , 'wiki.mass_parse.make');
looking at the logs (see #527) for the log, the last SQLITE error may be indicative?
mass_parse.fail_end; err=[err 0] <org.sqlite.SQLiteException> [SQLITE_CONSTRAINT_UNIQUE]
A UNIQUE constraint failed (UNIQUE constraint failed: hxtn_page.page_id, hxtn_page.wkr_id, hxtn_page.data_id)
[err 1] <db> db.engine:exec failed: url=data
source=g:\xowa\wiki\en.wikipedia.org\tmp\xomp\xomp_004\xomp_wkr.sqlite3;version=3;
sql=CREATE UNIQUE INDEX IF NOT EXISTS hxtn_page__pkey ON hxtn_page (page_id, wkr_id, data_id);
[trace]:
org.sqlite.core.DB.newSQLException(DB.java:909)
Yeah, this isn't good. This is related to TemplateStyles and #482 . I had problems with the UNIQUE INDEX but thought I got them all fixed for 2019-05. It's possible I'm still missing something for 2019-06
At this point, you can try to delete the duplicate rows and re-running wiki.parse_make
To find duplicate rows (have to run out: SQL written without testing. backup your file before running!):
SELECT hx.page_id || '|' || hx.wkr_id || '|' || hx.data_id
FROM hxtn_page hx
GROUP BY hx.page_id || '|' || hx.wkr_id || '|' || hx.data_id
HAVING COUNT(hx.page_id || '|' || hx.wkr_id || '|' || hx.data_id) > 1
To delete:
DELETE FROM hxtn_page
WHERE (page_id || '|' || wkr_id || '|' || data_id)
IN
(
SELECT hx.page_id || '|' || hx.wkr_id || '|' || hx.data_id
FROM hxtn_page hx
GROUP BY hx.page_id || '|' || hx.wkr_id || '|' || hx.data_id
HAVING COUNT(hx.page_id || '|' || hx.wkr_id || '|' || hx.data_id) > 1
)
The one duplicate was
34556475|0|57940169
this is the page Typing_environment
the delete statement needed some brackets to work
Would it be better to have deleted one of the two entries instead of both?
Having performed the delete (and remembering with my db browser to write the changes back)
Restarting at wiki.parse_make
has run to completion (including download of images)
This phase took about 18h
Great!
The one duplicate was 34556475|0|57940169 this is the page Typing_environment
Cool. I'm surprised that was the only one though. Did you have to use add ('en.wikipedia.org' , 'wiki.mass_parse.resume');
during your build ?
Otherwise, I'll keep this issue open until I get a chance to build 2019-06 on my side (and see if the issue is reproducible here)
the delete statement needed some brackets to work
Oops. Wrote that out just before I ran out to catch a train. Updated the comment above
Would it be better to have deleted one of the two entries instead of both?
Another oops. Mistake on my side. Only 1 should be deleted. If you deleted both, it means that TemplateStyles may not render properly for Typing_environment.
Having performed the delete (and remembering with my db browser to write the changes back) Restarting at wiki.parse_make has run to completion (including download of images) This phase took about 18h
And yet another oops. I meant it would take 5 mins for the hxtn INDEX. The entire wiki.parse_make
will take something like 18h
Let me know if there's anything else. Otherwise, as said above, will keep this issue open till I run 2019-06 on my side.
Thanks!
just spent the last four days building hdump of enwiki 2019-08-01 and unfortunately the issue is still there, one duplicate in the page Typing_environment
Hmm... Honestly don't know what's going on here.
I looked at the code, and can't figure out how duplicates occur for that one page. Thought process is below.
Will think about it a little more, but next step would be to add a step to delete dupes before applying the UNIQUE INDEX. Will let you know. Thanks.
I have just started the rebuild of 2019-06-01 html the mass.parse phase has gone very well and taken 3.5 days with very few #invoke errors
The process is onto the next phase and on-screen is the line
From the timestamps on the xomp_wkr.sqlite3 files, the mass parse finished about 01:12 today It is now 06:35 (approx 5 hours later)
Looking at Windows 10 task manager, I see little cpu usage and little disk usage I would expect that building an index would be intensive of both (more disk)
First: Is this what you expect If so, how long should it take (another three days?)
Second: If I were to break in, can I restart it without the mass parse