haskell / alex

A lexical analyser generator for Haskell
https://hackage.haskell.org/package/alex
BSD 3-Clause "New" or "Revised" License
297 stars 82 forks source link

[ #71 ] warn about nullable regexs in the absence of start codes #155

Closed andreasabel closed 4 years ago

andreasabel commented 4 years ago

Changed alex_scan_tkn to return AlexNone instead of AlexLastAcc or AlexLastSkip if length of newly processed input is still 0.

This fixes the problem that Alex would produce an infinite sequence of tokens of length 0 when the DFA was stuck in a state accepting a nullable token.

Now it will either produce EOF (if at end of input) or a lexer error (if not at the end of input). This is the intuitively expected behavior.

andreasabel commented 4 years ago

This fix requires the length to be computed correctly, see #119 (PR #156).

simonmar commented 4 years ago

If you define a token that can have length zero, returning an infinite sequence of those at the end of the input is correct, no? Shouldn't we just emit a warning and do what the user asked for?

andreasabel commented 4 years ago

Since Alex rejected nullable tokens before by simply crashing, this is the first time that nullable tokens are considered, and we are free to define their semantics. (Also, there is no need to rush the issue.)

My proposal is that all the tokens Alex ever produces will match at least 1 character of the input. I think this the intuitive expectation simply because the alternative, returning tokens that do not advance the input is useless. Unlike in parsers, where some non-terminals match the empty input, there is no way to request from the lexer "I would need this null token now, can you deliver it?" The lexer works in an autonomous way, giving the token that matches the longest piece of input; thus, it will only return a null token if nothing else is possible, ie, if it is stuck. And then, infinitely many null tokens. I cannot imagine a situation where this is useful behavior, can you?

On the contrary, allowing Alex to return null tokens destroys the following properties:

  1. The length of the output is bounded by the length of the input.
  2. This holds also for any prefix of the input (provided output is produced).
  3. The lexer terminates on finite input.

Additionally to these conceptual arguments, one could make some field studies of what other lexers produce in this situation. I'd be surprised if any lexer out there produces an infinite stream of nullable tokens, and my initial experiments using BNFC generated parsers do not suggest so. The rule (ocamllex, flex, JLex) seems to give parse errors, ANTLR is the outlier with delivering an imho incorrect positive result (a token containing the string ""). But to be absolutely sure, I would have to tighten my tests to involve lexers only.

simonmar commented 4 years ago

I'm fine with requiring all tokens to match at least one byte/character, but then we should make it a static error for there to be nullable tokens, rather than just ignoring them.

andreasabel commented 4 years ago

we should make it a static error for there to be nullable tokens, rather than just ignoring them.

For consideration, the non-nullable version of

  a*b*c*d*

is

  a+b*c*d* | a*b+c*d* | a*b*c+d* | a*b*c*d+

which is a linear blowup.

Insisting on non-nullability of the RE rather than just to actual tokens produced that match the RE could place an unnecessary burden on the user. I observed ANTLR warning about nullable REs, I haven't observed any lexer generator giving an outright error.

simonmar commented 4 years ago

That makes sense. Can we give a warning then?

andreasabel commented 4 years ago

Yes, I think giving warnings is the appropriate thing.
I am currently a bit laden, I'll look into this when I cleared some things up.

andreasabel commented 4 years ago

[ #71 ] Emit warnings for nullable tokens.

The parser for .x files will now generate warnings for rules whose regular expression is nullable, i.e., matches the empty input.

Alex does not emit tokens that consume no input, thus, a warning is appropriate.

Warnings are kept as a stack in the state of the parse monad, and printed upon successful completion of the parser in Main.

Example (excerpt of issue_71.x):

  $whitespace  = [\ \n\t]
  @whitespaces = $whitespace*

  :-

  @whitespaces { \ _ _ -> Whitespaces }
               ^ Warning here!

Warning: issue_71.x:24:14: Regular expression [..]* matches the empty string, but will be interpreted as not matching the empty string.

Since the parser does not generate abstract syntax with position information, it is hard to give the exact warning location, i.e., the location of the offending regular expression. To keep the changes minimal, we record the location of the token after the regular expression, i.e., the location where the code part begins. This approximate location should be good enough to jump to the regex in question.

Another problem is that the exact text of the regular expression is not printed in the warning, only what Show RExp gives us. This could be fixed if we had the exact location information of the regular expression; we could then cut the regular expression out of the input string. Alternatively, the parser could be modified to return a concrete syntax for the regular expression from which its original text could be recovered. The abstract reg.ex. would then be built from the concrete one in a second step.

At this point, I abstain from such invasive changes to Alex for the sake of improving this rare warning.

Bikeshedding on the warning text welcome:

Regular expression [..]* matches the empty string, but will be interpreted as not matching the empty string.

andreasabel commented 4 years ago

I just saw the empty token in Alex' own lexer: https://github.com/simonmar/alex/blob/6c4db72c3bb3419c84740e2e9ae112ed0abd572e/src/Scan.x#L73-L76 Does my PR break the logic implemented here? (Having a bad feeling about it, since I had some spooky boot-strapping failures for Alex now and then. Travis might actually not test if the just generated Alex can be used to bootstrap Alex; it bootstraps with an released Alex only.) I think skip needs to be replaced by a null action that simply does nothing.

UPDATE: I think I panicked for no reason. Ignore my comment. I did not understand what skip is doing.

andreasabel commented 4 years ago

Please do not merge yet.

First, I want to make sure travis also tests bootstrapping with the just generated Alex.

simonmar commented 4 years ago

So what happens with the afterstartcodes token in Alex now? Does it never match the empty stream? Wouldn't that break the logic?

andreasabel commented 4 years ago

So what happens with the afterstartcodes token in Alex now? Does it never match the empty stream? Wouldn't that break the logic?

Yeah, that is what I am worried about, but I haven't reached full understanding yet.

andreasabel commented 4 years ago

It seems that my reasoning about nullable tokens is incorrect once we have startcodes that change upon receiving an empty token. Since this changes the state of the lexer, producing one empty token does not mean that from then on, only empty tokens will be produced.

Thus, it is also questionable whether a warning should be emitted. Maybe a warning could be emitted unless startcodes are involved? (However, the motivation to act any further on issue #71 has diminished with the observation that empty tokens do serve a role in the lexer.)

The question is also what the user should do about the warning. It could be ignored, but some like warning-freeness. Then, how to get rid of the warning, e.g. in case one uses a regex like

  a*b*c*d*

The most convenient way would be a regex-operator that removes the empty word. I have seen such an operator used in theory, but in practice, it does not seem standard. Well, maybe the user that likes warning-freeness has to bite the bullet and manually unfold to the lengthy

a+b*c*d* | a*b+c*d* | a*b*c+d* | a*b*c*d+

So, the conclusion of my musings: What could be salvaged from this PR?

simonmar commented 4 years ago

I imagine the warning would only make sense if we can prove that the lexer will produce a continuous stream of empty tokens at some point, i.e. it won't terminate. It's not clear to me exactly what conditions are needed for this though.

andreasabel commented 4 years ago

Mechanism to issue a warning on nullable tokens if no startcodes are used.

This is what is implemented now. The warning is:

Warning: issue_71.x:24:14: Regular expression [..]* matches the empty string.

I imagine the warning would only make sense if we can prove that the lexer will produce a continuous stream of empty tokens at some point, i.e. it won't terminate.

Indeed.

It's not clear to me exactly what conditions are needed for this though.

If the lexer does not use start codes, it state does not change when emitting a null token, since it does not advance the input. Thus, it will make the same decision and again emit a null token, and again and again...

simonmar commented 4 years ago

Thanks!