gnosygnu / xowa

xowa offline wiki application
Other
374 stars 41 forks source link

Luaj: Regex not supported for unicode characters: ä (yet another infinite loop (enwiki)) #502

Closed desb42 closed 5 years ago

desb42 commented 5 years ago

I have just spent the last 10 days rebuilding enwiki 2019-06-01 Unfortunately it looks like 4 of 8 wkr processes have not completed My CPU is running at ~80% The xomp.sqlite3 stats are

0* 20190619_121255    853000
1  20190616_035155    169999
2  20190624_085035   1791906
3* 20190615_165908     42000
4* 20190617_072934    354999
5* 20190620_104504    823000
6  20190624_090028   2497825
7  20190624_085115   2327204

The rows marked with an asterisk are the databases that are inaccessible (while the build is running)

I have also added more namespace ids to the generate process

{cfg {ns_ids = '0|4|8|12|14|100';}}

I am going to crash out now and see what I can dig up

gnosygnu commented 5 years ago

Ugh, that stinks.

A few immediate questions:


I'm working on #483 but with 2019-05 (instead of your 2019-06).

desb42 commented 5 years ago

How much memory are you running at?

I am using -Xmx15000m and it seems to be OK

I have used wiki.mass_parse.resume to consolidate the build files and have found that there are 18401 rows with page_status of 0

I have been firing all these pages at my web server and so far have found that en.wikipedia.org/wiki/Mummenschanz is causing a loop (this is the 2563rd entry) When I break into my debug version the stack has a lot of Match_state.match lines I will continue my investigations

desb42 commented 5 years ago

Looking at the stack trace of en.wikipedia.org/wiki/Mummenschanz I notice that the work is taking place inside Module:Authority_control

I think the function p.tlsLink( id ) is the culprit It is creating a very long regex If I comment out the call to the regex, the page displays

gnosygnu commented 5 years ago

Cool. Thanks for the example. I'm pulling down 2019-06 now. I'll add this to the weekend backlog

desb42 commented 5 years ago

Progressing through the list (9641st) en.wikipedia.org/wiki/Huldreich_Georg_Früh has the same regex problem

desb42 commented 5 years ago

And another one at 13769 en.wikipedia.org/wiki/Zurich_University_of_the_Arts

gnosygnu commented 5 years ago

Thanks for the breakdown and examples. As always, incredibly helpful.

I think the function p.tlsLink( id ) is the culprit It is creating a very long regex

This regex makes me so sad...

local class = "[%a%d_',%.%-%(%)%*/]"
local regex = "^%u"..string.rep(class, 3)..string.rep(class.."?", 56).."$"

That said, this is a bug in XOWA and Match_state related to an earlier issue: #413. In short, ä is not recognized as an alphabetic letter. It should be considered alphabetic. Other details below.

I'll have a fix for this later today / tomorrow.


413 involved redirecting Scribunto regex calls from Java to Luaj.

Another way to look at this is with these pseudo call-stacks:

So #413 involved switching the regex engine from Java to Luaj.

Unfortunately, Luaj is based on Lua (C) which is ASCII for all string handling (I believe this is one of the reasons why MediaWiki created UString).

gnosygnu commented 5 years ago

Fixed with commits above. Tested with urls below.

Will kick off another build for 2019-06 in the next week or so.

Thanks!