haskell / alex

A lexical analyser generator for Haskell
https://hackage.haskell.org/package/alex
BSD 3-Clause "New" or "Revised" License
297 stars 82 forks source link

Feature: allow the user to capture groups on regex #148

Open german1608 opened 4 years ago

german1608 commented 4 years ago

Currently, when alex executes a token action, the whole match is captured. For example:

%wrapper "basic" -- the behaviour is similar on the other wrappers
tokens :-
  \"a\"  { \s -> Token a }

{
data Token = Token String
}
-- After reading, alex would return [Token "\"a\""] for input = "\"a\""

Would be amazing that alex match syntax allows to capture groups, like other regexes:

%wrapper "basic" -- the behaviour is similar on the other wrappers
tokens :-
  \"(a)\"  { \s -> Token s }
-- After reading, alex would return [Token "a"] for the same input

Alex could also handle several groups in the same token action, so it could extract the groups as a list of strings or something alike.

simonmar commented 4 years ago

Yes, I would really like to have this functionality. As I recall it was non-trivial to implement it (especially if we want to do it without a performance overhead if you don't use the functionality), but it would be very useful to have it.

german1608 commented 4 years ago

We could try to have separate wrappers for that, something like:

%wrapper "basic-group-captor"
%wrapper "posn-group-captor"
%wrapper "monad-group-captor"
%wrapper "monadUserState-group-captor"

Or adding another directive, like:

%capturegroups

I think the last approach is better than the first one.

andreasabel commented 4 years ago

@german1608: is the automata-theoretic implementation of groups worked out some where? I found only the question: https://stackoverflow.com/questions/28941425/capture-groups-using-dfa-based-linear-time-regular-expressions-possible , and the answer pointed out that groups introduce nondeterminism, e.g. matching

a(b)|(a)b

against ab would give you non-deterministically a or b. This is maybe a can of worms we do not want to open.

Without drilling up the automata generation of Alex, one could imagine the following light-weight approach:

%wrapper "basic-regex-groups" 
tokens :-
  \"(a)\"  { \ r s -> Token (extractGroup r s) }
-- After reading, alex would return [Token "a"] for the same input

What Alex would do here, is:

This might not be as comfortable as it could be, but a rather lightweight and generic extension of Alex. In terms of efficiency, there is of course some duplication of work, but just extracting groups in a regexp that is known to match should be cheaper that figuring out which token to extract from the input in the first place.

andreasabel commented 1 year ago

@german1608: is the automata-theoretic implementation of groups worked out some where?

There is TDFAs (Tagged Deterministic Finite Automata), e.g. https://github.com/haskell-hvr/regex-tdfa .