Closed desb42 closed 5 years ago
Ugh, that stinks.
A few immediate questions:
I'm working on #483 but with 2019-05 (instead of your 2019-06).
How much memory are you running at?
I am using -Xmx15000m and it seems to be OK
I have used wiki.mass_parse.resume to consolidate the build files and have found that there are 18401 rows with page_status of 0
I have been firing all these pages at my web server and so far have found that en.wikipedia.org/wiki/Mummenschanz is causing a loop (this is the 2563rd entry) When I break into my debug version the stack has a lot of Match_state.match lines I will continue my investigations
Looking at the stack trace of en.wikipedia.org/wiki/Mummenschanz I notice that the work is taking place inside Module:Authority_control
I think the function p.tlsLink( id ) is the culprit It is creating a very long regex If I comment out the call to the regex, the page displays
Cool. Thanks for the example. I'm pulling down 2019-06 now. I'll add this to the weekend backlog
Progressing through the list (9641st) en.wikipedia.org/wiki/Huldreich_Georg_Früh has the same regex problem
And another one at 13769 en.wikipedia.org/wiki/Zurich_University_of_the_Arts
Thanks for the breakdown and examples. As always, incredibly helpful.
I think the function p.tlsLink( id ) is the culprit It is creating a very long regex
This regex makes me so sad...
local class = "[%a%d_',%.%-%(%)%*/]"
local regex = "^%u"..string.rep(class, 3)..string.rep(class.."?", 56).."$"
That said, this is a bug in XOWA and Match_state related to an earlier issue: #413. In short, ä
is not recognized as an alphabetic letter. It should be considered alphabetic. Other details below.
I'll have a fix for this later today / tomorrow.
%b()
which finds all nested parentheses. Another way to look at this is with these pseudo call-stacks:
So #413 involved switching the regex engine from Java to Luaj.
Unfortunately, Luaj is based on Lua (C) which is ASCII for all string handling (I believe this is one of the reasons why MediaWiki created UString).
a
is considered a letter, but something like ä
is not.Fixed with commits above. Tested with urls below.
Will kick off another build for 2019-06 in the next week or so.
Thanks!
I have just spent the last 10 days rebuilding enwiki 2019-06-01 Unfortunately it looks like 4 of 8 wkr processes have not completed My CPU is running at ~80% The xomp.sqlite3 stats are
The rows marked with an asterisk are the databases that are inaccessible (while the build is running)
I have also added more namespace ids to the generate process
I am going to crash out now and see what I can dig up