Open german1608 opened 4 years ago
Yes, I would really like to have this functionality. As I recall it was non-trivial to implement it (especially if we want to do it without a performance overhead if you don't use the functionality), but it would be very useful to have it.
We could try to have separate wrappers for that, something like:
%wrapper "basic-group-captor"
%wrapper "posn-group-captor"
%wrapper "monad-group-captor"
%wrapper "monadUserState-group-captor"
Or adding another directive, like:
%capturegroups
I think the last approach is better than the first one.
@german1608: is the automata-theoretic implementation of groups worked out some where? I found only the question: https://stackoverflow.com/questions/28941425/capture-groups-using-dfa-based-linear-time-regular-expressions-possible , and the answer pointed out that groups introduce nondeterminism, e.g. matching
a(b)|(a)b
against ab
would give you non-deterministically a
or b
. This is maybe a can of worms we do not want to open.
Without drilling up the automata generation of Alex, one could imagine the following light-weight approach:
%wrapper "basic-regex-groups"
tokens :-
\"(a)\" { \ r s -> Token (extractGroup r s) }
-- After reading, alex would return [Token "a"] for the same input
What Alex would do here, is:
r
that accepted the token s
, such that you can extract the group content from s
yourself. (A user-defined extractGroup
could utilize a third-party library after converting r
into the format required by that library.)r
would be provided as element of a type RExp
defined in the basic-regex-groups
wrapper, basically as an abstract syntax tree, see https://github.com/simonmar/alex/blob/6c4db72c3bb3419c84740e2e9ae112ed0abd572e/src/AbsSyn.hs#L198-L205.This might not be as comfortable as it could be, but a rather lightweight and generic extension of Alex. In terms of efficiency, there is of course some duplication of work, but just extracting groups in a regexp that is known to match should be cheaper that figuring out which token to extract from the input in the first place.
@german1608: is the automata-theoretic implementation of groups worked out some where?
There is TDFAs (Tagged Deterministic Finite Automata), e.g. https://github.com/haskell-hvr/regex-tdfa .
Currently, when alex executes a token action, the whole match is captured. For example:
Would be amazing that alex match syntax allows to capture groups, like other regexes:
Alex could also handle several groups in the same token action, so it could extract the groups as a list of strings or something alike.