Closed GoogleCodeExporter closed 9 years ago
If it's case-sensitive (no IGNORECASE flag), the words can be put into a set
and just looked-up when matching. Creating a set is fast.
However, if it's case-insensitive (IGNORECASE flag), it's not possible to use a
set because each letter in a word could be uppercase, lowercase, or titlecase,
and including all of the possible combinations could make it a very big set.
Therefore it falls back to listing the alternatives.
I'll see if I can improve its performance.
As for your final example, that's not one regex, but four:
1. regex.findall(r"\b\L<word>\b", text, regex.I, word=KEYWORDS) compiles a
regex that matches all of the keywords.
2. regex.findall(r"\b\L<word>\b", text, regex.I, word=OPERATORS) compiles a
regex that matches all of the operators.
3. regex.findall(r"\b\L<word>\b", text, regex.I, word=CONSTS) compiles a regex
that matches all of the constants.
4. regex.findall(r"\b\L<word>\b", text, regex.I, word=NAMES) compiles a regex
that matches all of the names.
The \L<...> pattern is just an alternative way of writing
"word1|word2|word3|...", except that:
1. You don't have to worry about the order in which the words are listed.
2. If it's case-insensitive, it's possible to implement it using a set instead.
Original comment by re...@mrabarnett.plus.com
on 23 Oct 2013 at 5:04
i could see fuzzy fall back to BRANCH.
but why case-insensitive either?
when with IGNORECASE:
for example, make all items in string-set lower-case'ed.
when matching, make the look-up text lower-case'ed.
compare to case-sensitive, the only difference is
one more to_lower() call before the look-up,
can this cause performance issue?
Original comment by Lyricconch
on 23 Oct 2013 at 6:02
i try to read the source,
it seems that case-insensitive string-set has been already implemented.
string_set_match_ign_fwdrev at _regex.c#7220
more:
begin at line 7303:
for (i = 0; i < len; i++) {
Py_UCS4 ch;
ch = simple_case_fold(char_at(text, text_pos + offset + i));
set_char_at(folded, i, ch);
}
status = string_set_contains_ign(state, string_set, folded, 0, len,
folded_charsize);
if (status == 1)
/* Advance past the match. */
state->text_pos += inc_len;
this do exact what the last post mentioned.
Original comment by Lyricconch
on 23 Oct 2013 at 6:23
While writing code, you have to decide how to tackle the various details that
arise.
Considering how much time has passed since I wrote that part, I think I'll take
the opportunity to look at it again and see whether I still think I made the
correct decision or should change it.
Original comment by re...@mrabarnett.plus.com
on 23 Oct 2013 at 6:30
yes when i read the source, i found that there are 4 compiled-pattern.
but... without fuzzy or "i"(which cause to be a BRANCH instead of STRING_SET),
the compiled-pattern are all the same except 2 field
(pattern.name_lists and pattern.name_list_index).
i think the unattached pattern worth to be cached,
and pattern with KEYWORDS(and so on...) attached should not, since:
1. compiling the regex is heavyweight, but binding the string_set is not.
cache should focus on the "compile".
2. pattern with difference string_set can be reused, as above.
Original comment by Lyricconch
on 23 Oct 2013 at 6:49
I've fixed the speed issue in regex 2013-10-23.
I'll need to think some more about whether I can avoid re-parsing a pattern if
only the named lists change. I suspect that in practice it doesn't occur that
often, and other regex implementations, which don't have it, have to build and
parse the pattern every time anyway.
Original comment by re...@mrabarnett.plus.com
on 23 Oct 2013 at 9:50
line 7484 and line 7616
may? still have some problem(missing 'case 1'):
switch (folded_charsize) {
case 2:
set_char_at = bytes2_set_char_at;
point_to = bytes2_point_to;
break;
case 4:
set_char_at = bytes4_set_char_at;
point_to = bytes4_point_to;
break;
default:
return 0;
}
Original comment by Lyricconch
on 23 Oct 2013 at 10:13
You're correct. Also added another test to detect it.
Fixed in regex 2013-10-24.
Original comment by re...@mrabarnett.plus.com
on 24 Oct 2013 at 12:05
Original issue reported on code.google.com by
Lyricconch
on 23 Oct 2013 at 9:13