standardebooks / tools

The Standard Ebooks toolset for producing our ebook files.
Other
1.43k stars 127 forks source link

y-016 #676

Closed vr8hub closed 8 months ago

vr8hub commented 8 months ago

The regex here is \\.\\.[^\\.]. But I can't figure out why the negative character class. Is the intent not to match three in a row? (As an aside, periods don't have to be escaped in a character class.)

If that is the intent, the regex as written doesn't work—it doesn't match the first two of three, but it does match the last two, since the character following the third one isn't a period. Same for any number in a row more than two; it will always match the last two (even if the following character is EOL).

To match two and not three (or more), both a negative lookbehind and a negative lookahead are needed, e.g. (?<!\\.)\\.{2}+(?!\\.). That finds two where neither the preceding nor following character is a period, thus preventing the last two from matching since they are preceded by one. But as noted it allows any number more than two, not just three.

If that was not the intent (to allow three), then the negative class can be eliminated entirely, just leaving \\.\\.

Let me know, and I'll change it to whichever it should be.

acabal commented 8 months ago

Correct, we want to match two consecutive periods but not if they're an ellipses. I think [^\\.]\\.\\.[^\\.] will work instead of a lookbehind + lookahead, which will be very slow.

vr8hub commented 8 months ago

I've made the change with the above, but it's the opposite, actually. The assertions take 30 steps to process three sentences, one with two, one with three, and one with four periods. The above takes 206 steps. (From regex101.com)

acabal commented 8 months ago

I would not rely on that site for speed tests, because I presume their engine is implement in JS, but we're using either Python's or lxml's regex engines which are different. In my experience lookahead/behind is much, much slower. You could try doing a short stress on the actual toolset to see.

vr8hub commented 8 months ago

We've already passed my level of care. :) But …

  1. You specify what engine you want on regex101 when you do your testing. I wouldn't trust it for performance, but # of steps is accurate for the regex engine being tested (in this case, e.g., python is more efficient than PCRE for the above). More steps may only mean fractions of a millisecond, so IRL it may not matter. But it's still indicative of how efficient the regex is for a particular engine.
  2. Moving away from regex101, an assertion only comes into play once a match is made. In this case, the above has to test every character that is not a period, then see if the next character is a period. By the time it gets to an actual period, it's looked at every character at least twice. The lookahead only has to look for a period, then it applies the lookbehind to see that the previous character isn't one, so it's only looked at the characters once.

Lookaheads can be slower, but it depends greatly on the regex. They can also be faster. And a regex with a lookahead can be inefficient for reasons that have nothing to do with the lookahead.

But we're still passed my level of care :), and I submitted the PR using the above.