maciejhirsz / logos

Create ridiculously fast Lexers
https://logos.maciej.codes
Apache License 2.0
2.83k stars 112 forks source link

Feature suggestion: Named sub-regexes #109

Closed Evrey closed 4 years ago

Evrey commented 4 years ago

As can be seen in #108 regexes can become quite complicated and long. The regex crate does support this feature of named character sets like [[:digit:]] aliasing [0-9]. However, not all of these are useful for everyone, and for some (or even many?) there's a lack of certain named sets. A set alone may not even be enough, but more complicated patterns may repeatedly come up here and there.

I propose a hack-y feature for logos that would allow users to define reüsable named sub-regexes. In my case, I have this simple rule appearing what feels like a million times:

(?x:(
  \u{000A} # LF
| \u{000B} # VT
| \u{000C} # FF
| (\u{000D} \u{000A}) # CRLF
| \u{0085} # NEL
| \u{2028} # LS
| \u{2029} # PS
))

Just a Unicode-aware and generous line break, with special rules like only allowing CR as combined CRLF.

The feature as I envision it would look something like this:

#[derive(Logos)]
#[logos(named_regex("nl", r#"\u{000A} | \u{000B} | …"#))]
enum Token {
    #[error]
    Error,

    #[regex = r#"//.*{{:nl:}}"#]
    LineComment,
    …
}

So #[logos(named_regex)] defines a custom rule, given the name nl, followed by an arbitrary regex. And whenever logos_derive sees a #[regex = ""] (Edit: or #[trivia]) rule, it would scan for an occurrence of pretty much {{:custom_rule_name:}}, where custom_rule_name is the name given to nl. If found, replace that sub string with (rule) (note the braces for hygiene), where rule are the line breaks in my case.

This would also allow for shorter aliases than the default ones. For example, I never use [[:xdigit:]], as it's just as long as [0-9a-fA-F]. I could alias that to just {{:0x:}} and do that for any number base I want.

Some suggested implementation details:

maciejhirsz commented 4 years ago

This sounds like a good feature to me, though it has to take a back seat given a backlog of other things I want to do or straighten up.

CAD97 commented 4 years ago

There already exists a syntax for named sub-patterns, so we should reuse one of them. PCRE uses (?&name) or (?P>name) to recurse into a named subpattern (or if it was defined in a (?(DEFINE)) block, just to call it). Defining would have to be logos-specific, but referencing the named sub-patterns should use one of those already used syntaxes.

More importantly, logos definitely shouldn't be doing any regex (pre) processing before sticking the regex into regex-syntax, because doing so is just asking for accidental oddities and mismatches with the regex crate.

(tree-sitter does regex preprocessing to get regex to behave more like javascript syntax in respect to brackets. It's a nightmare.)

So the first step is getting regex-syntax to recognize at least one of those syntaxes (both currently fail as an unrecognized flag). At least at the AST level, there's probably no objection to regex-syntax supporting these syntaxes.

And as the requisite bikeshed:

#[derive(Logos)]
#[logos(
    subpattern(nl = r#"\u{000A} | \u{000B} | …"#)
)]
enum Token {
    #[error]
    Error,

    #[regex = r#"//.*(?&nl)"#]
    LineComment,
    …
}

(oh, and I should note that the syntax for POSIX ranges is not [[:name:]] but rather [:name:]; it's just that the notation needs to occur in a character class [] to be used, so you'll most often see it as [[:ascii:]]. But something like [[:ascii:]--\r\n] is valid for the character class of "all ascii except \r or \n".)

CAD97 commented 4 years ago

Alright; regex-syntax (rightly) doesn't want to support syntax that isn't used by regex (https://github.com/rust-lang/regex/issues/666). We can probably handle (?&name) syntax properly with a preprocessing step, as impure as that may be.

maciejhirsz commented 4 years ago

@CAD97 I believe the suggestion here is to pre-process the pattern with regex syntax to make sure it's actual valid regex, but then when it comes to sticking it into the #[regex] just do a string replace.

That way we should be able to avoid all weirdness, but also provide errors pointing at the correct origin.

Evrey commented 4 years ago

Indeed that was the idea. Whether the syntax is {{:pat:}} or (?&pat) doesn't matter to me. If (?&pat) exists out there, pick that one.

CAD97 commented 4 years ago

The problem is that (?&name) or (?P>name) or {{:name:}} or whatever substitution syntax we choose isn't valid regex syntax for the regex crate, and it's important that it isn't, to avoid clashing with valid regex syntax.

The subpattern would be pre-checked, yes, but the string given to the #[regex] attribute would have to have (?&name) replaced with (?:(subpattern)) before being parsed with regex-syntax. I have had bad experiences with trying to preprocess regex before sticking it into the actual regex engine, so I'm somewhat naturally adverse to it.

However, I do think that because (?&name) is such a simple and noncustomizable syntax element that preprocessing it shouldn't be too difficult.

So, sketching the process that would be done:

This is about as principled as we can handle this unless we want to (potentially fork regex-syntax and) maintain our own regex parser. (I wouldn't be against writing and maintaining a "fancy-regex-syntax" or "pcre-syntax" crate that is only concerned with the parsing-to-ast step, but I definitely don't have the bandwidth to start that right now.)

maciejhirsz commented 4 years ago

Yep, that's basically what I thought should be the approach here.

To catch Logos-specific errors (stuff that regex-syntax is happy with, but Logos isn't) it'll be enough to convert regex-syntax' Hir into Logos' Mir, and then throw it away. It should be reasonably cheap.