gnosygnu / xowa

xowa offline wiki application
Other
374 stars 41 forks source link

endless loop in en.wikipedia.org/wiki/Lausanne_Conservatory #737

Closed desb42 closed 4 years ago

desb42 commented 4 years ago

Just been trying to build enwiki 2020-05-01 and after 4 days found that the process had not terminated.

In this case, tmp/xowa/xomp_006/xomp_wkr.sqlite3 did not close

I have determined that on accessing en.wikipedia.org/wiki/Lausanne_Conservatory within xowa-gui this also does not complete

Looking further the error seems to be related to Authority_control

The specific is that this page has a wikidata page that contains a particular property that is causing this issue Proprerty:P1362 - Theaterlexikon der Schweiz ID

Within Module:Authority_control there are the lines:

function p.tlsLink( id )
    local id2 = id:gsub(' +', '_')
    --P1362's format regex: \p{Lu}[\p{L}\d_',\.\-\(\)\*/–]{3,59} (e.g. Abcd)
    local class = "[%a%d_',%.%-%(%)%*/–]"
    local regex = "^%u"..string.rep(class, 3)..string.rep(class.."?", 56).."$"
    if not mw.ustring.match( id2, regex ) then
        return false
    end
    return '[http://tls.theaterwissenschaft.ch/wiki/'..id2..' '..id..']'..p.getCatForId( 'TLS' ) --no https as of 9/2019
end

It seems that the ustring.match is where the unending loop occurs

I also check so see where else this property is used and one such example is en.wikipedia.org/wiki/Richard_Wagner This page displays OK

My thinking is that it has something to do with the length of the source string In the case of LausanneConservatory the source string is `SPAD-_Section_professionnelle_d'art_dramatique_du_Conservatoire_de_Lausanne,_Lausanne_VD` I think it has something to do with this length being greater than 56 (in the code)

Looking at mediawiki the error The TLS id SPAD_-_Section_professionnelle_d'art_dramatique_du_Conservatoire_de_Lausanne,_Lausanne_VD is not valid. occurs within the Authority control section of the page

gnosygnu commented 4 years ago

Thanks for the amazing detail, and sorry for the delay. My email has been weird again.

Will pull down 2020-05 enwiki today, and try to reproduce. Probably comment again sometime tomorrow.

Thanks!

gnosygnu commented 4 years ago

I looked at this more, and this will be difficult.

I'm not entirely sure yet, but I believe the problem is with the pattern in tlsLink. It is a very long string of ? matches which leads to a lot of backtracking. The PHP parser handles this "optimally", but the same pattern / regex chokes the Lua and vanilla Luaj pattern processor as well as the Java regex engine. See reduced test samples below, as well as p.test_727 at https://en.wikipedia.org/wiki/Module:Sandbox/Gnosygnu.

As such, the only fixes are:

I'm leaning towards the PHP Regex Engine in Java, but need to think about this more.


LuaJ sample: 35 expressions -> 2m 31s

@Test public void Regex() {
  String src = "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz";
  String pat = "[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]??[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?[%a]?$";
  Varargs args = LuaValue.varargsOf(new LuaValue[] {LuaValue.valueOf(src), LuaValue.valueOf(pat), LuaValue.valueOf("1")});
  Varargs actl = new StringLib.gsub().invoke(args);
  // 2m 31s
  System.out.println(actl);
}

Java sample: 35 expressions -> 3m 52s

@Test public void Regex() {
  String src = "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz";
  String pat = "[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]??[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?[\\p{L}]?$";
  Pattern pattern = Pattern.compile(pat);
  Matcher matcher = pattern.matcher(src);
  // 3m 52s
  System.out.println(matcher.matches());
}

PHP sample: 56 expressions -> 1s

<?php
$string = 'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz';
$pattern = "[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?[\p{L}]?$";
$replacement = '1';
echo preg_replace($pattern, $replacement, $string);
echo 'done';
?>
desb42 commented 4 years ago

I took a quick look at the PHP regex and it seems that it uses https://github.com/kkos/oniguruma

which may well be PCRE

gnosygnu commented 4 years ago

I took a quick look at the PHP regex and it seems that it uses https://github.com/kkos/oniguruma

which may well be PCRE

Sorry, missed this comment (email still weird)

Yeah, I think it's PCRE. See https://www.mediawiki.org/wiki/Updating_to_PCRE_8.33_or_Higher : The extension Scribunto requires PCRE 8.33 or higher.

I added a hack timeout above which limits calls to 5 seconds. I'll figure out a better way to generalize this in the coming days, but for now, there is no alternative.

I'm still not sure which way to go. I'm leaning back to JNI as porting PCRE is a gargantuan effort and it looks like PCRE itself is moving to PCRE2 (which Scribunto may move to in the future)

desb42 commented 4 years ago

I have just merged the timeout change into my version and there seems to be a small issue.

The page 'Lausanne_Conservatory' does terminate (with an error about scribunto timeout).

However, as I run in a debugging environment most of the time (and xowa-http), the list of active threads should (once a page generation request has completed) show only one thread 'thread:xowa.http_server.server' running

Every time I request 'Lausanne_Conservatory', I end up with an additional thread 'scribunto' running The actual 'scribunto' thread(s) do(es) not terminate

gnosygnu commented 4 years ago

Every time I request 'Lausanne_Conservatory', I end up with an additional thread 'scribunto' running The actual 'scribunto' thread(s) do(es) not terminate

Yeah, I think the problem are the thread-locks throughout Xow_page_cache. For example, I commented this one out b/c it was failing in debug mode: https://github.com/gnosygnu/xowa/blob/27756f805626204edb92a3f8921fac72c3793deb/400_xowa/src/gplx/xowa/wikis/caches/Xow_page_cache.java#L64

I'll add another hack sometime tomorrow morning so as not to use thread.interrupt. Something like Globals.Interrupt = true with the MatchState checking this.

The alternative is to remove some of the thread-locks in Xow_page_cache, but that requires some rearchitecting. Also, I'm not sure why they get triggered (it looks like exceptions force the code to go throught the SWT thread pool, and I couldn't figure out why)

desb42 commented 4 years ago

Breaking into the 'scributo' thread, it looks to be in the same interminable loop as 'Lausanne_Conservatory' was before this change. So I am not sure how it would tidy itself up

desb42 commented 4 years ago

I have tried to do a number of html builds of enwikivoyage and am getting a lot of 'invoke failed:' and 'thread.failed:' on many pages, each run produces a different set of results

This hints to me that there is some threading interference - and given that a new thread has been introduced, I would guess the 'scribunto' thread as the likely cause

As the change is effectively restricted to Scrib_invoke_func.java, I am going to revert the code and check the html build

gnosygnu commented 4 years ago

As the change is effectively restricted to Scrib_invoke_func.java, I am going to revert the code and check the html build

Yeah, sorry, the code looks like it won't work. I'm still trying to figure out the issue here, as I think it's related to SWT's thread usage.

gnosygnu commented 4 years ago

Added another hack above to use Thread.stop instead of Thread.interrupt. Still looking at a longer term fix....

gnosygnu commented 4 years ago

A few things:

Thanks again @desb42 for the initial discovery and the thorough breakdown!


{ "name":"app"
/*
// uncomment below to set app-level default
`, "scribunto":{"timeout":5000, "sleep":0, "regexEngine":"luaj"}`
*/
, "wikis":
  [
    /*
    `keys` can list multiple wiki domains
    */
    { "keys":["en.wikipedia.org"]
    /*
    // uncomment below to set wiki-level default
    `, "scribunto":{"timeout":5000, "sleep":0, "regexEngine":"luaj"}`
    */
    , "scribunto.modules":
      [
        { "keys":["Authority_control"]
        /*
        // uncomment below to set module-level default
        `, "scribunto":{"timeout":5000, "sleep":0, "regexEngine":"luaj"}`
        */
        , "funcs":
          [
            /*
            specify config at `wiki.module.function` level
            */
            { "keys":["authorityControl", "tlsLink"]
            , "scribunto":{"timeout":5000, "sleep":0, "regexEngine":"luaj"}
            /*
            specify config at `wiki.module.function.page` level
            */
            , "pages":
              [
              ]
            }
          ]
        /*
        specify config at `wiki.module.page` level
        */
        , "pages":
          [
            { "keys":[]
            , "funcs":
              [
              ]
            }
          ]
        }
      ]
    }
  ]
}
gnosygnu commented 4 years ago

Marking this resolved as it worked okay during the full build of 2020-06. Actually fixing the long regex is tracked by this issue: https://github.com/gnosygnu/xowa/issues/749