lalrpop / lalrpop

LR(1) parser generator for Rust
http://lalrpop.github.io/lalrpop/
Apache License 2.0
3.04k stars 291 forks source link

Escape sequences don't seem to be handled properly #63

Open Fiedzia opened 8 years ago

Fiedzia commented 8 years ago

This works:

Phrase: String =  <a:r#"""#> <s:r#"[^"]*"#> <b:r#"""#> => s.to_owned();

This doesn't;

Phrase: String =  <a:r"\""> <s:r#"[^"]*"#> <b:r"\""> => s.to_owned();

lalrpop escapes \" into \", which is not what I'd expect

nikomatsakis commented 8 years ago

Looks wrong, thanks!

nixpulvis commented 8 years ago

I'm a bit confused, both of these examples don't compile for me. The first fails because of an ambiguity in <s:r#"[^"]*"#> and the second doesn't accept the malformed string with unexpected token: "> <s:r#"

Fiedzia commented 8 years ago

Here is exact part from my code, it definitely compiles (using build step, and stable rust):

Phrase: String =  <s:r#""[^"]*""#> => s.chars().skip(1).take(s.len()-2).collect::<String>();
ahmedcharles commented 6 years ago

Lalrpop seems to have string literals and regex literals. String literals are like Rust normal string literals, which support escape sequences and regex literals are like Rust raw string literals, which don't support escape sequences.

There does seem to be a bug in that regexes without a # in them seem to behave differently than ones with #'s.

jimblandy commented 6 years ago

The handling of escapes in string literal tokens is broken as well. The following grammar accepts strings of two backslashes, not one:

grammar;

pub Expr: String = {
    <slash: "\\"> => slash.to_owned()
}

Internally, Tokenizer::string_literal parses that as Tok::StringLiteral(r#"\\"#) (in Rust syntax), which gets passed to lexer::re::parse_literal, which then tries to escape it again, resulting in the regular expression r#"\\\\"#.

It seems to me that the payload of Tok::StringLiteral should be the represented text (i.e., with escape sequences replaced), not the representation of the text. Doing the same for Tok::RegexLiteral would fix the case reported in the top comment.

[Edit:]

It's clear from the rest of the definition of Tok that variants are supposed to be borrowing slices from the input string, not owning their own contents. This means that escape sequences need to be expanded at a later phase.

jimblandy commented 6 years ago

Leaving string literals aside and considering only regular expressions:

As far as I can tell, LALRPOP is behaving as designed, and @nikomatsakis was hasty when he agreed this was a bug.

The conversation reflects a bit of confusion about LALRPOP's regular expression literals. Let me try to put it down:

A regular expression literal is written rHASHES"EXPR"HASHES, where HASHES is some string of zero or more # characters (the same string on both ends), and EXPR does not contain the string "HASHES. The regular expression denoted is exactly EXPR; backslashes and double-quote characters have no special meaning to the tokenizer here.

According to that definition, the regular expressions in this issue are as follows:

As far as I can tell, LALRPOP does behave this way.

@ahmedcharles says:

There does seem to be a bug in that regexes without a # in them seem to behave differently than ones with #'s.

but I am not able to see any difference.

I think the source of confusion may be that LALRPOP's syntax is generally quite close to Rust's, but whereas Rust offers the option of escaped or raw string literals, LALRPOP strings are always escaped, and LALRPOP regexps are always raw. (The hijacking of the r prefix was perhaps not an ideal choice.)

So I think this issue should be closed.

String literals, on the other hand, are definitely broken. I'll file a separate issue for those, so as not to usurp @Fiedzia's issue.