Closed desb42 closed 4 years ago
Thanks for the amazing detail, and sorry for the delay. My email has been weird again.
Will pull down 2020-05 enwiki today, and try to reproduce. Probably comment again sometime tomorrow.
Thanks!
I looked at this more, and this will be difficult.
I'm not entirely sure yet, but I believe the problem is with the pattern in tlsLink
. It is a very long string of ?
matches which leads to a lot of backtracking. The PHP parser handles this "optimally", but the same pattern / regex chokes the Lua and vanilla Luaj pattern processor as well as the Java regex engine. See reduced test samples below, as well as p.test_727
at https://en.wikipedia.org/wiki/Module:Sandbox/Gnosygnu.
As such, the only fixes are:
JRegex
which hasn't been developed since 2002 and seems like it has many bugs: https://sourceforge.net/p/jregex/bugs/I'm leaning towards the PHP Regex Engine in Java
, but need to think about this more.
@Test public void Regex() {
String src = "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz";
String pat = "[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]??[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?$";
Varargs args = LuaValue.varargsOf(new LuaValue[] {LuaValue.valueOf(src), LuaValue.valueOf(pat), LuaValue.valueOf("1")});
Varargs actl = new StringLib.gsub().invoke(args);
// 2m 31s
System.out.println(actl);
}
@Test public void Regex() {
String src = "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz";
String pat = "[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]??[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?$";
Pattern pattern = Pattern.compile(pat);
Matcher matcher = pattern.matcher(src);
// 3m 52s
System.out.println(matcher.matches());
}
<?php
$string = 'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz';
$pattern = "[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?$";
$replacement = '1';
echo preg_replace($pattern, $replacement, $string);
echo 'done';
?>
I took a quick look at the PHP regex and it seems that it uses https://github.com/kkos/oniguruma
which may well be PCRE
I took a quick look at the PHP regex and it seems that it uses https://github.com/kkos/oniguruma
which may well be PCRE
Sorry, missed this comment (email still weird)
Yeah, I think it's PCRE. See https://www.mediawiki.org/wiki/Updating_to_PCRE_8.33_or_Higher : The extension Scribunto requires PCRE 8.33 or higher.
I added a hack timeout above which limits calls to 5 seconds. I'll figure out a better way to generalize this in the coming days, but for now, there is no alternative.
I'm still not sure which way to go. I'm leaning back to JNI as porting PCRE is a gargantuan effort and it looks like PCRE itself is moving to PCRE2 (which Scribunto may move to in the future)
I have just merged the timeout change into my version and there seems to be a small issue.
The page 'Lausanne_Conservatory' does terminate (with an error about scribunto timeout).
However, as I run in a debugging environment most of the time (and xowa-http), the list of active threads should (once a page generation request has completed) show only one thread
'thread:xowa.http_server.server' running
Every time I request 'Lausanne_Conservatory', I end up with an additional thread
'scribunto' running
The actual 'scribunto' thread(s) do(es) not terminate
Every time I request 'Lausanne_Conservatory', I end up with an additional thread 'scribunto' running The actual 'scribunto' thread(s) do(es) not terminate
Yeah, I think the problem are the thread-locks throughout Xow_page_cache. For example, I commented this one out b/c it was failing in debug mode: https://github.com/gnosygnu/xowa/blob/27756f805626204edb92a3f8921fac72c3793deb/400_xowa/src/gplx/xowa/wikis/caches/Xow_page_cache.java#L64
I'll add another hack sometime tomorrow morning so as not to use thread.interrupt
. Something like Globals.Interrupt = true
with the MatchState
checking this.
The alternative is to remove some of the thread-locks in Xow_page_cache, but that requires some rearchitecting. Also, I'm not sure why they get triggered (it looks like exceptions force the code to go throught the SWT thread pool, and I couldn't figure out why)
Breaking into the 'scributo' thread, it looks to be in the same interminable loop as 'Lausanne_Conservatory' was before this change. So I am not sure how it would tidy itself up
I have tried to do a number of html builds of enwikivoyage and am getting a lot of 'invoke failed:' and 'thread.failed:' on many pages, each run produces a different set of results
This hints to me that there is some threading interference - and given that a new thread has been introduced, I would guess the 'scribunto' thread as the likely cause
As the change is effectively restricted to Scrib_invoke_func.java, I am going to revert the code and check the html build
As the change is effectively restricted to Scrib_invoke_func.java, I am going to revert the code and check the html build
Yeah, sorry, the code looks like it won't work. I'm still trying to figure out the issue here, as I think it's related to SWT's thread usage.
Added another hack above to use Thread.stop
instead of Thread.interrupt
. Still looking at a longer term fix....
A few things:
Thread.stop
: Although Thread.stop
is discouraged, I don't see any real risk with calling it (each thread works on its own state). I'll reevaluate this after the next enwiki build, but hoping to leave as isThanks again @desb42 for the initial discovery and the thorough breakdown!
{ "name":"app"
/*
// uncomment below to set app-level default
`, "scribunto":{"timeout":5000, "sleep":0, "regexEngine":"luaj"}`
*/
, "wikis":
[
/*
`keys` can list multiple wiki domains
*/
{ "keys":["en.wikipedia.org"]
/*
// uncomment below to set wiki-level default
`, "scribunto":{"timeout":5000, "sleep":0, "regexEngine":"luaj"}`
*/
, "scribunto.modules":
[
{ "keys":["Authority_control"]
/*
// uncomment below to set module-level default
`, "scribunto":{"timeout":5000, "sleep":0, "regexEngine":"luaj"}`
*/
, "funcs":
[
/*
specify config at `wiki.module.function` level
*/
{ "keys":["authorityControl", "tlsLink"]
, "scribunto":{"timeout":5000, "sleep":0, "regexEngine":"luaj"}
/*
specify config at `wiki.module.function.page` level
*/
, "pages":
[
]
}
]
/*
specify config at `wiki.module.page` level
*/
, "pages":
[
{ "keys":[]
, "funcs":
[
]
}
]
}
]
}
]
}
Marking this resolved as it worked okay during the full build of 2020-06. Actually fixing the long regex is tracked by this issue: https://github.com/gnosygnu/xowa/issues/749
Just been trying to build enwiki 2020-05-01 and after 4 days found that the process had not terminated.
In this case, tmp/xowa/xomp_006/xomp_wkr.sqlite3 did not close
I have determined that on accessing
en.wikipedia.org/wiki/Lausanne_Conservatory
within xowa-gui this also does not completeLooking further the error seems to be related to Authority_control
The specific is that this page has a wikidata page that contains a particular property that is causing this issue Proprerty:P1362 - Theaterlexikon der Schweiz ID
Within Module:Authority_control there are the lines:
It seems that the ustring.match is where the unending loop occurs
I also check so see where else this property is used and one such example is
en.wikipedia.org/wiki/Richard_Wagner
This page displays OKMy thinking is that it has something to do with the length of the source string In the case of LausanneConservatory the source string is `SPAD-_Section_professionnelle_d'art_dramatique_du_Conservatoire_de_Lausanne,_Lausanne_VD` I think it has something to do with this length being greater than 56 (in the code)
Looking at mediawiki the error
The TLS id SPAD_-_Section_professionnelle_d'art_dramatique_du_Conservatoire_de_Lausanne,_Lausanne_VD is not valid.
occurs within the Authority control section of the page