softdevteam / grmtools

Rust grammar tool libraries and binaries
Other
507 stars 31 forks source link

Use the offset at EOF for LexErrorKind::PrematureEnd. #340

Closed ratmice closed 2 years ago

ratmice commented 2 years ago

Previously this used the character before EOF. While there is currently no way to produce this error such that it ends in a multibyte character. This makes the behavior match yacc.

ratmice commented 2 years ago

While there is currently no way to produce this error such that it ends in a multibyte character.

It looks like this might be a misstatement, just beginning to analyze the results of a fuzzing run where this appears to have failed and I believe it might be possible to end in a multibyte whitespace character presumably because of is_whitespace().

ratmice commented 2 years ago

@ltratt mind chiming in on whether we should be accepting multibyte whitespace characters in fbcc378 or not?

ltratt commented 2 years ago

I didn't know there was such a thing as multibyte whitespace! Since we do aim to support unicode properly, I think the right thing to do is to deal with multibyte whitespace properly, but I can be persuaded otherwise.

ratmice commented 2 years ago

Okay, I believe that the extent of that might be just changing the parse_ws function to use

if !c.is_whitespace() {
 break 
}

But I'll have to see what the callers think of that change.

ratmice commented 2 years ago

Contrary to the above, I believe it would be worthwhile to just follow https://www.unicode.org/reports/tr31/ which is intended for parsing/lexing of unicode and the Pattern_Whitespace property defined in, https://www.unicode.org/Public/13.0.0/ucd/PropList.txt

In addition to e.g. 'is_whitespace' we would need to identify the SpaceSeparator Zs class subset of Whitespace or Pattern_Whitespace.

For Pattern_Whitespace this SpaceSeparator looks to be just the typical ASCII space character. Pattern_Whitespace does contain some multibyte characters still, e.g. LineSeparator and ParagraphSeparator so we'd still encounter this issue with those.

There is an issue in unicode-xid, pointing to some rustc code which might be helpful, (Or I could try and solve it there, if you feel like having it as a dependency would be ok, it looks like we actually already depend on it for the serde feature). https://github.com/unicode-rs/unicode-xid/issues/1

Anyhow it seems to me, this would be the relevant unicode guidelines

ltratt commented 2 years ago

Unicode makes my head hurt, so I'm happy to go with what you think best (provided it doesn't break anything obvious, and we have some new tests to make sure we don't introduce new oddities)!

ratmice commented 2 years ago

So, I think one of the reasons it makes sense not to use is_whitespace() is things like Ogham Space Mark, which is to be read right to left (or bottom to top I would guess?), so it is kind of strange to allow it in something that isn't really right-to-left etc.

Anyhow, editor support for all these non-ASCII characters is pretty terrible, i'm really not sure how I would actually feel about people using them. But even if we decide to not use the \p{Pattern_white_Space} things, it at least should be fairly simple now to change them to whatever we decide on and handle it uniformly across the code base.

The regex's could be a bit nicer, but I don't think there is a \p{gc=Space_Separator} etc in rusts regex, I at least couldn't find anything... :shrug:

ltratt commented 2 years ago

I had to look up Ogham -- I had no idea it existed!

It feels to me like this PR probably does enough as-is to make the situation meaningfully better. If anyone does try to write lex/yacc files in the same manner as 4th-6th century Irish, we can revisit this. [Via the wonders of wikipedia I now know that I can find such writing in Cornwall...]

ltratt commented 2 years ago

I think this is ready for squashing? If so, please go ahead.

ratmice commented 2 years ago

Before I squash,

I would like to understand why in that last testcase I added, the first one gets InvalidStartStateName, while the second one gets UnknownDeclaration, I think it would be ideal if they both got InvalidStartStateName. but i'll have to look at it tomorrow.

ltratt commented 2 years ago

Please squash.

ratmice commented 2 years ago

Squashed.

ltratt commented 2 years ago

bors r+

bors[bot] commented 2 years ago

Build succeeded: