standardebooks / web

The source code for the Standard Ebooks website.
https://standardebooks.org
Creative Commons Zero v1.0 Universal
234 stars 61 forks source link

Shorten regex to find probable british2american errors #350

Closed vr8hub closed 4 months ago

vr8hub commented 4 months ago

The regex for finding issues after british2american is too restrictive; I've never had it find anything, but I always have errors after running it. The problem is what comes after the em-dash.

It looks for a ldquo followed by text and then a rsquo and em-dash; this is perfect so far. When this happens, it's almost always a break in a piece of dialog, e.g. "He said something’—some kind of aside—“and then he said something else.”

But the rest of the regex after the em-dash has a negative lookahead assertion for a rdquo, which is almost always present, so it doesn't match any of them. Thus, in something I'm working on now, the Step-by-Step regex didn't find anything, but the one in the PR found six valid errors.

It's possible for the above to show a false-positive, but in my experience they're few and far between, and this is one of the cases where IMO we want to find as much as possible, so highlighting the occasional false-positive that can be ignored is preferable to missing a bunch of actual errors, which is the current situation.

(As another possible improvement, maybe we want to make this an se interactive-replace string they can just cut-and-paste, so they can work through them interactively?)

acabal commented 4 months ago

Thanks! Yes, this would be a good place for an se interactive-replace invocation.