maciejhirsz / logos

Create ridiculously fast Lexers
https://logos.maciej.codes
Apache License 2.0
2.85k stars 115 forks source link

How to match Rust raw string literal? #351

Closed lightsing closed 10 months ago

lightsing commented 10 months ago

By definition the lexer rule is:

RAW_STRING_LITERAL :
   r RAW_STRING_CONTENT SUFFIX?

RAW_STRING_CONTENT :
      " ( ~ IsolatedCR )* (non-greedy) "
   | # RAW_STRING_CONTENT #

Although it's not a regular grammar, but I want to emit token that may be a raw string (it can have unpaired #). Like this regex: r#*"(?:\r\n|[^\r])*?"#*, but logos current not support non-greedy parsing:

error: #[regex]: non-greedy parsing is currently unsupported.

How can I match such token?

jeertmans commented 10 months ago

Hello!

Indded, the issue here is that your pattern includes a ?, which is not supported by Logos at the moment (i.e., non-greedy).

I guess a solution might be to match for start a raw string, e.g., r#*". Once matched, you use callbacks to find the first matching end with the same number of #.

lightsing commented 10 months ago

Thanks! And I got something like this, put it here in case someone end up searching here:

/// This callback got fired when the lexer encountered a raw string literal starting symbol
/// with regex `r#*"`.
pub(crate) fn raw(
    lexer: &mut logos::Lexer<'input, Token<'input>>,
) -> Result<Self, LexicalError> {
    static RAW_STRING_RE: OnceLock<Regex> = OnceLock::new();
    let re = RAW_STRING_RE.get_or_init(|| {
        Regex::new(r##"^r(?<left_hash>#*)"(?<content>(?:\r\n|[^\r])*?)"(?<right_hash>#*)"##)
            .unwrap()
    });
    let span = lexer.span();
    let input = &lexer.source()[span.start..];
    let captures = match re.captures(input) {
        Some(captures) => captures,
        None => return Err(LexicalError::IncompleteRawStringLiteral(span.into())),
    };
    let whole = captures.get(0).unwrap().as_str();
    let content = captures.name("content").map(|m| m.as_str()).unwrap_or("");
    let left_hash = captures
        .name("left_hash")
        .map(|m| m.as_str().len())
        .unwrap_or(0);
    let right_hash = captures
        .name("right_hash")
        .map(|m| m.as_str().len())
        .unwrap_or(0);
    if left_hash != right_hash {
        return Err(LexicalError::UnmatchedRawStringDelimiter {
            span: (span.start..span.start + whole.len()).into(),
            left: left_hash,
            right: right_hash,
        });
    }
    lexer.bump(whole.len() - span.len());
    Ok(Self::Raw {
        content,
        level: left_hash,
    })
}
jeertmans commented 10 months ago

Thanks @lightsing! Could you also put the enum declaration with its variants so we have a complete example?

However, regarding your code, could you not just be looking searching for the same number of #? I am not sure with the syntax rule for raw string, but I guess r###" must be terminated by exactly ###"?

If so, then the callback could be:

/// This callback got fired when the lexer encountered a raw string literal starting symbol
/// with regex `r#*"`.
pub(crate) fn raw(
    lexer: &mut logos::Lexer<'input, Token<'input>>,
) -> Result<Self, LexicalError> {
    let span = lexer.span();
    let count = span.end - start.start - 2; // Number of '#'
    let remaining = &lexer.source()[span.end..];
    let mut pattern = String::with_capacity(count + 1);
    pattern.push('"');
    for _ in 0..count {
        pattern.push('#');
    }

    let end = match remaining.find(pattern) {
        Some(end) => end,
        None => return Err(LexicalError::IncompleteRawStringLiteral(span.into())),
    };

    lexer.bump(end + count + 1);

    Ok(Self::Raw {
        ...
    })
}
lightsing commented 10 months ago

Thank you @jeertmans , but there's also a rule reject isolated-cr (a \r not followed by \n) appears in raw string. reference: https://rustwiki.org/en/reference/tokens.html#raw-string-literals

full code:

#[derive(Debug, Copy, Clone, PartialEq)]
pub enum StringLiteral<'input> {
    Plain(&'input str),
    Raw { content: &'input str, level: usize },
}

impl<'input> StringLiteral<'input> {
    pub(crate) fn plain(lexer: &mut logos::Lexer<'input, Token<'input>>) -> Self {
        let matched = lexer.slice();
        Self::Plain(&matched[1..matched.len() - 1])
    }

    /// This callback got fired when the lexer encountered a raw string literal starting symbol
    /// with regex `r#*"`.
    pub(crate) fn raw(
        lexer: &mut logos::Lexer<'input, Token<'input>>,
    ) -> Result<Self, LexicalError> {
        static RAW_STRING_RE: OnceLock<Regex> = OnceLock::new();
        let re = RAW_STRING_RE.get_or_init(|| {
            Regex::new(r##"^r(?<left_hash>#*)"(?<content>(?:\r\n|[^\r])*?)"(?<right_hash>#*)"##)
                .unwrap()
        });
        let span = lexer.span();
        let input = &lexer.source()[span.start..];
        let captures = match re.captures(input) {
            Some(captures) => captures,
            None => return Err(LexicalError::IncompleteRawStringLiteral(span.into())),
        };
        let whole = captures.get(0).unwrap().as_str();
        let content = captures.name("content").map(|m| m.as_str()).unwrap_or("");
        let left_hash = captures
            .name("left_hash")
            .map(|m| m.as_str().len())
            .unwrap_or(0);
        let right_hash = captures
            .name("right_hash")
            .map(|m| m.as_str().len())
            .unwrap_or(0);
        if left_hash != right_hash {
            return Err(LexicalError::UnmatchedRawStringDelimiter {
                span: (span.start..span.start + whole.len()).into(),
                left: left_hash,
                right: right_hash,
            });
        }
        lexer.bump(whole.len() - span.len());
        Ok(Self::Raw {
            content,
            level: left_hash,
        })
    }
}
jeertmans commented 10 months ago

Oh ok I get it :)

if you are worried about performances, I still think there is a better way that does not need to compile a regex at runtime, but don’t have much time to dig into it right now 😅

anyway, thanks for the example!