Question: Why an interpolated regex does not work

st-clair-clarke commented 1 year ago

I have the following that works:

export const phoneNorthAmerica = XRegExp.tag('nx')`
      ^\(?(?<area>      [2-9][0-8][0-9]  )\)?[-. ]?
          (?<exchange>  [2-9][0-9]{2}    )[-. ]?
          (?<station>   [0-9]{4}         )$
   `

The following does NOT work? Can you say why?

// the regex string above minus the anchors ^ and $
export const phoneNorthAmericaExpr = `
      \(?(?<area>      [2-9][0-8][0-9]  )\)?[-. ]?
          (?<exchange>  [2-9][0-9]{2}    )[-. ]?
          (?<station>   [0-9]{4}         )
   `

// The following interpolates the phoneNorthAmericaExpr, adding back the anchors ^ and $
// However, it does not work
export const phoneNorthAmerica = XRegExp.tag('nx')`
      ^${phoneNorthAmericaExpr}$
   `

Thanks

slevithan commented 1 year ago

Two issues:

The first is that, like in your issue #354, you forgot to either use String.raw or escape your backslashes in your string phoneNorthAmericaExpr.

The second is that, when you interpolate a string (rather than a regex) into an XRegExp.tag pattern, the string's special characters are escaped so that they are matched as literal characters. It is (literally) passed through XRegExp.escape (see here). This is documented, albeit not very prominently. The docs for XRegExp.tag say "interpolated strings have their special characters escaped".

The following would work without the unexpected gotcha, since it interpolates a regex rather than a string:

const regex = XRegExp.tag('x')`^ ${/.../} $`;

// or
const regex1 = XRegExp('...');
const regex2 = /.../;
const regex3 = XRegExp.tag('x')`^ ${regex1} ${regex2} $`;

So to avoid this issue, I'd recommend interpolating regexes (which you can create as native or XRegExp regexes) instead of strings with XRegExp.tag. Alternatively, you can use XRegExp.build (with strings and/or regexes) since build composes the strings and/or regexes into one pattern (rather than interpolating) and therefore doesn't escape strings.

Aside: There is other documented but potentially unexpected behavior when interpolating. From the docs: "Interpolated patterns are treated as atomic units when quantified, interpolated strings have their special characters escaped, a leading ^ and trailing unescaped $ are stripped from interpolated regexes if both are present, and any backreferences within an interpolated regex are rewritten to work within the overall pattern."

These behaviors all have their reasons and IMO are all desirable if understood and remembered. E.g., de-anchoring subpatterns allows embedding independently useful anchored regexes (and it almost never makes sense to interpolate a fully anchored pattern; you can double the anchors if you really need them). If I recall, the reasoning behind escaping interpolated strings is that errant special characters in a string such as a trailing unescaped backslash or a closing square bracket can break or change the meaning of the surrounding regex. But I've come to the opinion that the behavior of XRegExp.tag when interpolating is too magical/unexpected and it would be better to remove this behavior (allowing users to more easily shoot themselves in the foot, but at least not be surprised by unexpected/undesired behavior). The exception is that the behavior of rewriting backreferences for the overall pattern should stay since that is clearly good/useful/powerful. Alas, changing the behavior of interpolated strings (and/or other behaviors mentioned above) would be a breaking change so is not likely to happen anytime soon, at least not without help from contributors via PRs. However, I've never seen people run into problems with the super-edge-case behavior/enhancements of embedded regexes, so really it's just interpolated strings being escaped that is a significant potential gotcha.

st-clair-clarke commented 1 year ago

Thanks a million. I am going through your book now, so I have not read everything as yet. So there will be some gotchas. The reason for my above gotcha is that the regex for a validation phone number and searching for a phone number in a documentation/text are quite similar as you pointed out in the book - the ^ and $ anchors are replaced with \b. I was trying NOT to repeat the common part of the regex twice.

I appreciate your help. Thanks again.

slevithan commented 1 year ago

Yeah, that's a great use case.

Alas, either use XRegExp.build (which doesn't automatically take raw strings so it's slightly less pretty than .tag) or store the reusable part of the pattern as a regex (rather than a string) if you want to use .tag.

Sorry for the unexpected surprise with escaped interpolated strings. :) Like I said above, I realize now that it's not intuitive.

st-clair-clarke commented 1 year ago

Thanks. I don't see it as s surprise though. I put it all to my lack of understanding Regex at the moment. I am getting better at it though. Thanks for your help.

st-clair-clarke commented 1 year ago

With your advice above, I have done some refactoring and will use this to deal with the dual regex needed when there is a difference in validation and searching docs. I do like the concept of tags though.

import { match } from 'ts-pattern'

export const phoneNorthAmericanRegex = ( searchType: string): RegExp | string => {
   const areaCode = XRegExp.tag('nx')`     (?<area>      [2-9][0-8][0-9]  )`
   const exchangeCode = XRegExp.tag('nx')` (?<exchange>  [2-9][0-9]{2}    )`
   const stationCode = XRegExp.tag('nx')`  (?<station>   [0-9]{4}         )`

   return match(searchType)
      .returnType<RegExp | string>()
      .with('validation', () => {
         return XRegExp.tag('nx')`
          ^\(? ${areaCode}     \)?[-. ]?
               ${exchangeCode} [-. ]?
               ${stationCode}  $`
      })
      .with('docs', () => {
         return XRegExp.tag('nx')`
          \b\(? ${areaCode}     \)?[-. ]?
               ${exchangeCode} [-. ]?
               ${stationCode}  \b`
      })
      .otherwise(
         () =>
            `Non-existent North America phone number with search option '${searchType}'`,
      )
}

slevithan commented 1 year ago

Cool. I like it.

Another option might be to have more generic functions like withValidation and withBoundaries. Ex:

const withValidation = (regex) => {
  return XRegExp.tag(regex.flags)`^(?:${regex})$`;
}
const withBoundaries = (regex) => {
  return XRegExp.tag(regex.flags)`\b(?:${regex})\b`;
}

I've included the non-capturing groups (?: ... ) so that this still works correctly if given regexes with top-level alternation like /ab|cd/. But it was just to be able to explain that point; the non-capturing groups aren't actually needed since XRegExp.tag wraps interpolated regexes automatically! In other words, you only need ^${regex}$ and \b${regex}\b. (From the docs I quoted above: "Interpolated patterns are treated as atomic units when quantified". That also means you can do things like XRegExp.tag('x')` . ${regex} ?` and the entire embedded ab|cd would be optional; not just the letter d.)

The code above relies on RegExp.prototype.flags, which doesn't exist in ancient browsers. You can get flags without that but it would be uglier.

Not saying this approach is better than what you shared. Just throwing more ideas out there.

Edit: Removed flag x from my example functions so that the flag doesn't conflict with provided regexes which might include meaningful whitespace.

st-clair-clarke commented 1 year ago

I like it. It would be closer to what I originally thought about NOT repeating myself. Let me give it a try and see how it comes out. Ancient browsers are not a problem for what I am doing. So no fear.

st-clair-clarke commented 1 year ago

After refactoring to include your latest suggestion:

export const phoneNorthAmericanRegex = ( searchType = 'validation'): RegExp | string => {
   const areaCode = XRegExp.tag('nx')`     (?<area>      [2-9][0-8][0-9]  )`
   const exchangeCode = XRegExp.tag('nx')` (?<exchange>  [2-9][0-9]{2}    )`
   const stationCode = XRegExp.tag('nx')`  (?<station>   [0-9]{4}         )`

   const baseRegex = XRegExp.tag('nx')`
            \(? ${areaCode}     \)?[-. ]?
                ${exchangeCode} [-. ]?
                ${stationCode}  `

   return match(searchType)
       .returnType<RegExp | string>()
       .with('validation', () => {
          return XRegExp.tag('nx')`${withValidation(baseRegex)}`
       })
       .with('docs', () => {
          return XRegExp.tag('nx')`${withBoundaries(baseRegex)}`
       })
       .otherwise(
           () =>
               `Non-existent North America phone number with search option '${searchType}'`,
       )
}

Great suggestion. Thanks.

slevithan commented 1 year ago

Cool, glad it was helpful. An observation though is that if you used my example functions without change, you don't need to wrap the regexes a final time with XRegExp.tag since the withValidation and withBoundaries functions already return regexes.

st-clair-clarke commented 1 year ago

Yep! You are correct. Made the adjustment. Thanks

slevithan / xregexp

Question: Why an interpolated regex does not work #355