ocaml-community / sedlex

An OCaml lexer generator for Unicode
MIT License
235 stars 43 forks source link

Supporting buffers that can ignore mark requests #81

Open ifazk opened 5 years ago

ifazk commented 5 years ago

It can sometimes be useful for buffers ignore a mark action. For example, a lexbuffer can decide that it would be inappropriate to mark at the current position because it is not a grapheme cluster boundary.

Is this a use case that the maintainers of this project would be willing to support?

ifazk commented 5 years ago

On a more technical note, as far as I understand, we would only need to make a small change to call_state in ppx_sedlex.ml. There, instead of returning a best final state, we mark the best final state and immediately backtrack. I'm trying this out in a personal fork, but I need to do a bit more testing before I can send a pull request.

pmetzger commented 5 years ago

Let us know. That said, it seems peculiar that a regular expression would mark somewhere that you're not allowed to mark. Marking is below the level of the UI after all.

ifazk commented 5 years ago

Here's an example of a situation where a regular expression would mark where you don't want it to mark. Suppose you have the regular expression leg and the utf8 encoded string for "leg̈" (in ocaml syntax that would be "leg\u{0308}"), a specially designed buffer can ignore the mark after the g because it knows that g is not at a character boundary.

pmetzger commented 5 years ago

I don't understand your example. Is the issue that you don't want to match leg when you have leg̈?

ifazk commented 5 years ago

Yes, that my issue. I don't want a match to be successful in the middle of a user-perceived character. Most people would consider to be a single character instead of two characters consisting of g and \u{0308}.

pmetzger commented 5 years ago

Okay, so that's not "supporting buffers that can ignore mark requests", that's noting a possible bug in the way the regexp engine works. Have you read the Unicode Consortium TR 18 document? I haven't looked at it recently, but what does it claim the correct behavior is supposed to be?

pmetzger commented 5 years ago

(A quick scan of TR18 seems to indicate that what you want is an implementation of 2.2.1 Grapheme Cluster Mode but I'm not sure that's precisely what you're asking for.)

ifazk commented 5 years ago

What I'm precisely asking for: I want backtracking to be the only way to take a semantic action.

I want sedlex to always mark when it is in a final DFA state, and always backtrack when it can no longer transition to new DFA states. The only case where we don't currently do this is when we are at a final DFA state and there are no transitions to new DFA states. In this case, we directly take a semantic action, instead of marking and then backtracking (see the code that #83 changes).

Always using backtrack to perform semantic actions, allows custom Sedlexing buffers to control the positions at which semantic actions are allowed to take place. In the case where we are at a final DFA state and there are no more transitions out of this state, a custom Sedlexing buffer can decide whether a semantic action should take place at this position, and if not, it backtrack to the last position it would have allowed a semantic action to take place.

ifazk commented 5 years ago

My personal use case for this: I have a custom Sedlexing buffer which is aware of Grapheme Cluster boundaries, and only stores marked state at code point positions which are at boundaries. If we try to mark at a position which is not a boundary, it just ignores the Sedlexing.mark.

This does not give us Grapheme Cluster Mode. According to the last paragraph of 2.2.1, in Grapheme Cluster Mode, we should treat /(x\u{308})/ and /(x)(\u{308})/ differently, and /(x)(\u{308})/ should never match. I'm just using sedlex as the regex engine, and it makes no distinction between /(x\u{308})/ and /(x)(\u{308})/.


Supporting any of RL2.2 Extended Grapheme Clusters, RL2.3 Default Word Boundaries or Grapheme Cluster Mode is beyond the scope of sedlex. All of these require the regex engine itself to understand grapheme clusters, words or boundaries, and I don't think it is worth the engineering effort to teach these to sedlex.

I think always using backtracking for semantic actions is a middle ground where using custom Sedlexing buffers we can only allow semantic actions to happen at boundaries.

pmetzger commented 5 years ago

Okay, given that what you wan't isn't Grapheme Cluster Mode, I have to confess, and perhaps it's just that I'm thick, that I don't understand what it is that you want to do, even having read the above a few times. Perhaps you could give an example of what the resulting syntax expressing your improvement would be in a real sedlex specification?

ifazk commented 5 years ago

Thanks for being so patient with me! I'm going to try to explain myself again.

Suppose I have the following code:

match%sedlex buf with
| "let" -> Token.LET
| "=" -> Token.EQ
| eof -> Token.EOF
| _ -> raise (Token.error @@ Sedlexing.positions buf)

and the input stream "let\u{308}".

For the above, sedlex will not call mark or backtrack at t and instead just directly take a semantic action and return Token.LET.

Here, instead of directly taking a semantic action at t, I want sedlex to call mark and then backtrack and then take the appropriate semantic action according to backtrack.

In my custom buffer, if we try to call mark at the t, my custom buffer won't actually store anything, since t does not occur before a grapheme cluster boundary. This way, when sedlex calls backtrack, I raise an error instead of returning Token.LET.

If we had the input stream "let=" instead, since t occurs at a boundary, my custom buffer will store the state, and when backtrack is called, will return Token.LET.

Instead of having sedlex worry about boundaries, I'm trying to move the burden to buffers. Currently, while a custom buffer can control what mark and backtrack do, it cannot prevent semantic actions from happening at all non-boundary codepoints, because if there are no transitions out of a DFA state, sedlex directly takes a semantic action instead of going through the buffer first.