Closed lightsing closed 10 months ago
Hello!
Indded, the issue here is that your pattern includes a ?
, which is not supported by Logos at the moment (i.e., non-greedy).
I guess a solution might be to match for start a raw string, e.g., r#*"
. Once matched, you use callbacks to find the first matching end with the same number of #
.
Thanks! And I got something like this, put it here in case someone end up searching here:
/// This callback got fired when the lexer encountered a raw string literal starting symbol
/// with regex `r#*"`.
pub(crate) fn raw(
lexer: &mut logos::Lexer<'input, Token<'input>>,
) -> Result<Self, LexicalError> {
static RAW_STRING_RE: OnceLock<Regex> = OnceLock::new();
let re = RAW_STRING_RE.get_or_init(|| {
Regex::new(r##"^r(?<left_hash>#*)"(?<content>(?:\r\n|[^\r])*?)"(?<right_hash>#*)"##)
.unwrap()
});
let span = lexer.span();
let input = &lexer.source()[span.start..];
let captures = match re.captures(input) {
Some(captures) => captures,
None => return Err(LexicalError::IncompleteRawStringLiteral(span.into())),
};
let whole = captures.get(0).unwrap().as_str();
let content = captures.name("content").map(|m| m.as_str()).unwrap_or("");
let left_hash = captures
.name("left_hash")
.map(|m| m.as_str().len())
.unwrap_or(0);
let right_hash = captures
.name("right_hash")
.map(|m| m.as_str().len())
.unwrap_or(0);
if left_hash != right_hash {
return Err(LexicalError::UnmatchedRawStringDelimiter {
span: (span.start..span.start + whole.len()).into(),
left: left_hash,
right: right_hash,
});
}
lexer.bump(whole.len() - span.len());
Ok(Self::Raw {
content,
level: left_hash,
})
}
Thanks @lightsing! Could you also put the enum declaration with its variants so we have a complete example?
However, regarding your code, could you not just be looking searching for the same number of #
? I am not sure with the syntax rule for raw string, but I guess r###"
must be terminated by exactly ###"
?
If so, then the callback could be:
/// This callback got fired when the lexer encountered a raw string literal starting symbol
/// with regex `r#*"`.
pub(crate) fn raw(
lexer: &mut logos::Lexer<'input, Token<'input>>,
) -> Result<Self, LexicalError> {
let span = lexer.span();
let count = span.end - start.start - 2; // Number of '#'
let remaining = &lexer.source()[span.end..];
let mut pattern = String::with_capacity(count + 1);
pattern.push('"');
for _ in 0..count {
pattern.push('#');
}
let end = match remaining.find(pattern) {
Some(end) => end,
None => return Err(LexicalError::IncompleteRawStringLiteral(span.into())),
};
lexer.bump(end + count + 1);
Ok(Self::Raw {
...
})
}
Thank you @jeertmans , but there's also a rule reject isolated-cr (a \r
not followed by \n
) appears in raw string.
reference: https://rustwiki.org/en/reference/tokens.html#raw-string-literals
full code:
#[derive(Debug, Copy, Clone, PartialEq)]
pub enum StringLiteral<'input> {
Plain(&'input str),
Raw { content: &'input str, level: usize },
}
impl<'input> StringLiteral<'input> {
pub(crate) fn plain(lexer: &mut logos::Lexer<'input, Token<'input>>) -> Self {
let matched = lexer.slice();
Self::Plain(&matched[1..matched.len() - 1])
}
/// This callback got fired when the lexer encountered a raw string literal starting symbol
/// with regex `r#*"`.
pub(crate) fn raw(
lexer: &mut logos::Lexer<'input, Token<'input>>,
) -> Result<Self, LexicalError> {
static RAW_STRING_RE: OnceLock<Regex> = OnceLock::new();
let re = RAW_STRING_RE.get_or_init(|| {
Regex::new(r##"^r(?<left_hash>#*)"(?<content>(?:\r\n|[^\r])*?)"(?<right_hash>#*)"##)
.unwrap()
});
let span = lexer.span();
let input = &lexer.source()[span.start..];
let captures = match re.captures(input) {
Some(captures) => captures,
None => return Err(LexicalError::IncompleteRawStringLiteral(span.into())),
};
let whole = captures.get(0).unwrap().as_str();
let content = captures.name("content").map(|m| m.as_str()).unwrap_or("");
let left_hash = captures
.name("left_hash")
.map(|m| m.as_str().len())
.unwrap_or(0);
let right_hash = captures
.name("right_hash")
.map(|m| m.as_str().len())
.unwrap_or(0);
if left_hash != right_hash {
return Err(LexicalError::UnmatchedRawStringDelimiter {
span: (span.start..span.start + whole.len()).into(),
left: left_hash,
right: right_hash,
});
}
lexer.bump(whole.len() - span.len());
Ok(Self::Raw {
content,
level: left_hash,
})
}
}
Oh ok I get it :)
if you are worried about performances, I still think there is a better way that does not need to compile a regex at runtime, but don’t have much time to dig into it right now 😅
anyway, thanks for the example!
By definition the lexer rule is:
Although it's not a regular grammar, but I want to emit token that may be a raw string (it can have unpaired
#
). Like this regex:r#*"(?:\r\n|[^\r])*?"#*
, but logos current not support non-greedy parsing:How can I match such token?