compile '\L<...>' with 'i' flag was very slow

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
python -m timeit -n 1 -s "import regex,os; x=[os.urandom(10) for i in 
xrange(5000)]" "regex.compile(r'\L<x>', regex.I, x=x)"
1 loops, best of 3: 1.03 sec per loop

python -m timeit -n 1 -s "import regex,os; x=[os.urandom(10) for i in 
xrange(5000)]" "regex.compile(r'\L<x>', x=x)"
1 loops, best of 3: 6.26 msec per loop

What is the expected output? What do you see instead?
-

What version of the product are you using? On what operating system?
regex-2013_10_22-py2.7-win32

Please provide any additional information below.
i found that _regex_core.py try to compile string_set as BRANCH, 
and never generate STRING_SET_IGN, 
but _regex.c indeed have "STRING_SET_IGN". seems strange.

more, using \L<x> to avoid regex.compile cost is not working.
for example, i'd like following r"\b\L<word>\b" just compile once and reuse 4 
times.

for word in (KEYWORDS, OPERATORS, CONSTS, NAMES):
     word_list = regex.findall(r"\b\L<word>\b", text, regex.I, word=word)

Original issue reported on code.google.com by Lyricconch on 23 Oct 2013 at 9:13

GoogleCodeExporter commented 9 years ago

If it's case-sensitive (no IGNORECASE flag), the words can be put into a set 
and just looked-up when matching. Creating a set is fast.

However, if it's case-insensitive (IGNORECASE flag), it's not possible to use a 
set because each letter in a word could be uppercase, lowercase, or titlecase, 
and including all of the possible combinations could make it a very big set. 
Therefore it falls back to listing the alternatives.

I'll see if I can improve its performance.

As for your final example, that's not one regex, but four:

1. regex.findall(r"\b\L<word>\b", text, regex.I, word=KEYWORDS) compiles a 
regex that matches all of the keywords.

2. regex.findall(r"\b\L<word>\b", text, regex.I, word=OPERATORS) compiles a 
regex that matches all of the operators.

3. regex.findall(r"\b\L<word>\b", text, regex.I, word=CONSTS) compiles a regex 
that matches all of the constants.

4. regex.findall(r"\b\L<word>\b", text, regex.I, word=NAMES) compiles a regex 
that matches all of the names.

The \L<...> pattern is just an alternative way of writing 
"word1|word2|word3|...", except that:

1. You don't have to worry about the order in which the words are listed.

2. If it's case-insensitive, it's possible to implement it using a set instead.

Original comment by re...@mrabarnett.plus.com on 23 Oct 2013 at 5:04

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

i could see fuzzy fall back to BRANCH. 
but why case-insensitive either?

when with IGNORECASE:
for example, make all items in string-set lower-case'ed.
when matching, make the look-up text lower-case'ed.
compare to case-sensitive, the only difference is 
one more to_lower() call before the look-up, 
can this cause performance issue?

Original comment by Lyricconch on 23 Oct 2013 at 6:02

GoogleCodeExporter commented 9 years ago

i try to read the source, 
it seems that case-insensitive string-set has been already implemented.

string_set_match_ign_fwdrev at _regex.c#7220
more:
begin at line 7303:

        for (i = 0; i < len; i++) {
            Py_UCS4 ch;

            ch = simple_case_fold(char_at(text, text_pos + offset + i));
            set_char_at(folded, i, ch);
        }

        status = string_set_contains_ign(state, string_set, folded, 0, len,
          folded_charsize);

        if (status == 1)
            /* Advance past the match. */
            state->text_pos += inc_len;

this do exact what the last post mentioned.

Original comment by Lyricconch on 23 Oct 2013 at 6:23

GoogleCodeExporter commented 9 years ago

While writing code, you have to decide how to tackle the various details that 
arise.

Considering how much time has passed since I wrote that part, I think I'll take 
the opportunity to look at it again and see whether I still think I made the 
correct decision or should change it.

Original comment by re...@mrabarnett.plus.com on 23 Oct 2013 at 6:30

GoogleCodeExporter commented 9 years ago

yes when i read the source, i found that there are 4 compiled-pattern.

but... without fuzzy or "i"(which cause to be a BRANCH instead of STRING_SET), 
the compiled-pattern are all the same except 2 field
(pattern.name_lists and pattern.name_list_index).

i think the unattached pattern worth to be cached, 
and pattern with KEYWORDS(and so on...) attached should not, since:
1. compiling the regex is heavyweight, but binding the string_set is not.
cache should focus on the "compile".
2. pattern with difference string_set can be reused, as above.

Original comment by Lyricconch on 23 Oct 2013 at 6:49

GoogleCodeExporter commented 9 years ago

I've fixed the speed issue in regex 2013-10-23.

I'll need to think some more about whether I can avoid re-parsing a pattern if 
only the named lists change. I suspect that in practice it doesn't occur that 
often, and other regex implementations, which don't have it, have to build and 
parse the pattern every time anyway.

Original comment by re...@mrabarnett.plus.com on 23 Oct 2013 at 9:50

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

line 7484 and line 7616 
may? still have some problem(missing 'case 1'):

    switch (folded_charsize) {
    case 2:
        set_char_at = bytes2_set_char_at;
        point_to = bytes2_point_to;
        break;
    case 4:
        set_char_at = bytes4_set_char_at;
        point_to = bytes4_point_to;
        break;
    default:
        return 0;
    }

Original comment by Lyricconch on 23 Oct 2013 at 10:13

GoogleCodeExporter commented 9 years ago

You're correct. Also added another test to detect it.

Fixed in regex 2013-10-24.

Original comment by re...@mrabarnett.plus.com on 24 Oct 2013 at 12:05

jamadden / mrab-regex-hg

compile '\L<...>' with 'i' flag was very slow #96