no-context / moo

Optimised tokenizer/lexer generator! πŸ„ Uses /y for performance. Moo.
BSD 3-Clause "New" or "Revised" License
817 stars 65 forks source link

Match negation group? #131

Closed nhnicwaller closed 4 years ago

nhnicwaller commented 4 years ago

I'm using Moo to construct a lexer for ER7, the better-known legacy encoding scheme for HL7 v2 messages. Everything is going swimmingly, except for one tiny detail.

One complicating factor with ER7 is that the message header can reconfigure the delimiters used for the rest of the message. A message using the default delimiters:

MSH|^~\&|Field 1|p1^p2^p3^p4|

A message with non-standard delimiters:

MSH|*~\&|Field 1|p1*p2*p3*p4|

Of course, converting these delimiters into normalized tokens is what Moo is great at!

export function compileLexer(delim): moo.Lexer {
  return moo.compile({
    fieldDivider: delim.fieldSeparator,
    repeatField: delim.repeatSeparator,
    componentDivider: delim.componentSeparator,
    subcomponentDivider: delim.subcomponentSeparator,

That works great, until I want to detect a span of plain text between delimiter tokens. Currently I'm just hardcoding this pattern based on the default delimiters used in HL7. πŸ™ƒ

    text: /[^|~^&\\\r\n]+/,

Clearly the text span should also respect the delimiters used in the message. But I can't safely drop random characters into a regular expression character class here, because some characters (ie. ^, -, ], \) have special meanings! I know that it is possible for me to escape arbitrary characters into this character class... but at that point, it seems like maybe something Moo should be handling for me.

"You might be able to go faster still by writing your lexer by hand rather than using RegExps, but that's icky. Oh, and it avoids parsing RegExps by itself. Because that would be horrible."

I guess what I'm looking for is a rule roughly like this.

    text: {not: ['^', '|', '~', '^', '&'], quantifier: '*?'},

Am I on the right track? Is there already a better way to do this with Moo that I'm not aware of?

tjvr commented 4 years ago

I can't safely drop random characters into a regular expression character class here, because some characters have special meanings!

I don't think it's unreasonable to construct a RegExp by escaping the delimiter, indeed this is what Moo does internally when you provide it a string literal. You could copy the reEscape helper that Moo uses to do this.

I can't immediately think of a better way to do it. @nathan, am I missing something? :-)

nathan commented 4 years ago

I agree with @tjvr: reEscape seems like the best and simplest way to accomplish this:

text: new RegExp(`[^|~${reEscape(delim)}&\\\\\r\n]+`)

(If that's too ugly for you, template literals can come to the rescue.)

nhnicwaller commented 4 years ago

Okay, thanks for sharing your thoughts! That reEscape helper should do the trick here.