Closed vr8hub closed 8 months ago
Correct, we want to match two consecutive periods but not if they're an ellipses. I think [^\\.]\\.\\.[^\\.]
will work instead of a lookbehind + lookahead, which will be very slow.
I've made the change with the above, but it's the opposite, actually. The assertions take 30 steps to process three sentences, one with two, one with three, and one with four periods. The above takes 206 steps. (From regex101.com)
I would not rely on that site for speed tests, because I presume their engine is implement in JS, but we're using either Python's or lxml's regex engines which are different. In my experience lookahead/behind is much, much slower. You could try doing a short stress on the actual toolset to see.
We've already passed my level of care. :) But …
Lookaheads can be slower, but it depends greatly on the regex. They can also be faster. And a regex with a lookahead can be inefficient for reasons that have nothing to do with the lookahead.
But we're still passed my level of care :), and I submitted the PR using the above.
The regex here is
\\.\\.[^\\.]
. But I can't figure out why the negative character class. Is the intent not to match three in a row? (As an aside, periods don't have to be escaped in a character class.)If that is the intent, the regex as written doesn't work—it doesn't match the first two of three, but it does match the last two, since the character following the third one isn't a period. Same for any number in a row more than two; it will always match the last two (even if the following character is EOL).
To match two and not three (or more), both a negative lookbehind and a negative lookahead are needed, e.g.
(?<!\\.)\\.{2}+(?!\\.)
. That finds two where neither the preceding nor following character is a period, thus preventing the last two from matching since they are preceded by one. But as noted it allows any number more than two, not just three.If that was not the intent (to allow three), then the negative class can be eliminated entirely, just leaving
\\.\\.
Let me know, and I'll change it to whichever it should be.