Closed Drup closed 6 years ago
Yes, this PPX would use a revisit too. My contribution will be scattered this week, but I should have more time after that. The current library could use a full RE parser, which hopefully will be usable for both tyre and the current Re.Pcre
based patterns by keeping the syntax close enough.
About composition for tyre, do you have an idea what kind of value x
should capture for a match of (?<x>T(?&int){2}:(?&float))
? For a classic RE, it would of course capture whole group as a string, but for tyre the question is whether it would not be more natural to ignore the literal parts and bind x
to an (int * int) * float
or int list * float
tuple.
Don't worry, in the future means "in the coming weeks" ;)
As for your example: Ignoring the literal parts is easy and so it'll definitely do one of those two types. The current version doesn't have bounded repetitions, precisely because it's not clear which version you want! int list
is easier to implement, but I suppose int * int
is more useful. I was considering providing a tuple module for these kind of things. Unfortunately you can't really implement tuple : int -> 'a t -> _ t
.
By the way, this is buggy in the current version of ppx_regexp
let test3 s =
(match%pcre s with
| {|(?<a>[0-9]*)|(?<b>[a-z]*)|} -> assert (a ^ b = s)
| s' -> assert (s = s'))
Tyre handles that by returning something of type [> `Left of int | `Right of string]
. The fact that you can give different names to the groups in the various branches is slightly bothersome.
Yep, I realized that while commenting on your earlier post. It's one reason I need a proper parser. For %pcre
it should just return an option, though. This semantic is not as nice as tyre, but I think it's a fairly faithful view of the built-in capabilities of re.
On a minor note, I think it would be good with a real either
(or sum
) type if we could guess what the stdlib might include in a future version.
Another trick question about the tyre variant: How should x
be bound in a nested aggregation like (((?<x>[^;\n]+);)*\n)*
? Should this be allowed at all? I think a sound interpretation is that x
represents a suitable projection operator mapped though the surrounding aggregations. On the other hand, this is already accessible be explicit mapping if the capture is done at the topmost level.
Another trick question about the tyre variant: How should x be bound in a nested aggregation like
(((?<x>[^;\n]+);)*\n)*?
# Tyre.(rep (rep (posix "[^;\n]+") <* str";") <* str "\n") ;;
- : string Tyre.gen Tyre.gen Tyre.t
a <*b
matches a
, then b
, returns the result of the left and drops the result from the right.
This is barely tricky, it's what I build Tyre for :D
Yes, but the trick part was about how to intepret the variable binding under the two star operators. Note also that there could have been a second variable binding parallel to (?<x>...)
or at a different level. If we only allow capturing the full result into a single variable, then the correspondence with tyre route is straight forward, but I can see tree levels of support for binding variables:
In the first case, the variable name [corr.: does not] need to be inside the regexp source.
Initially, I would suggest the following restriction: There can not be nested groups. That should solve these kind of issues, since it means Regexs inside groups are regular PCREs that can be given to Re
. Constructs that are not inside a group are re-interpreted as Tyre operators.
Later, we could consider relaxing this restriction with a rather easy-to-add Tyre
function with_str : 'a t -> ('a * string) t
.
I added you as a collaborator, so that we can commit directly. Let's work on a separate branch tyre
, as I link to keep master
in good shape. I'll commit the non-tyre parts to the same branch, as they will likely be needed. Esp. I think we can share the same parse tree and parser, though I'll have some redundancy to support both the current and new flavours, maybe including positions to allow fine-grained error reporting.
I'm not quite sure what you had in mind yet for the tyre syntax, including the use of with_str
. My though after your last comment would be to allow nested groups, which would produce tuples, lists, and options for tyre, but with only top-level named capture. E.g. (?<x>a*(b(c+)d+(?&int)?)*))f
would bind x
to (string * int option) list
because:
a*
, b
, and f
are not directly grouped, so they will not be captured, even though the first is variable.c+
is captured and will produce the string
part of the tuple.(?&int)
is captured as the second part of the tuple, the ?
makes it optional.*
turns the tuples into a list.There will be some details to resolve if we go with this plan.
In addition to function%tyre
and match%tyre
, we might want to support constructing new converting regular expressions with something like
let%tyre qname = fun {|(?<x0>\w+)(?<xs>(\.(\w+))+)|} -> x0 :: xs
Notes on this:
%tyre
on the let
to indicate that it is meant to be referred to as (?&qname)
in another regular expression, reserving function%tyre
and match%tyre
for ready-to-call functions and expressions.(\.(w+))
-group could have used non-capturing syntax, since the capture is implied by the inner group. We could enforce either-or here.function
, but that will require some elaborate route composition and matching up groups, so we can check later if it's feasible.Are we on the same page? Or what are your thoughts?
hmmm, your semantics for capturing might work. I'm always highly annoyed by the fact that parens-for-priority and parens-for-groups are overloaded. I was going to take the opposite stance, which is that only named groups are captured, including when nested. Other parens are only used for priority, not capturing.
More concretely, your example would be written a*(b(?<foo>c+)d+(?<bar>(?&int)?))*f
and would return an element of type < foo : string ; bar : int option> list
. Either that, or some form of hmap. This allows to nest things easily. We could consider having toplevel names bound directly to variables instead of using something structural. I'm not completely sure yet.
As for defining new regular expressions. Just [%tyre {|....|}]
will do. With let%tyre foo = {|...|}
as a shortcut. foo
will be a real identifier of type Tyre.t
anyway.
I agree! It's better if the short-syntax grouping is non-capturing. I was going to suggest (?<>...)
or even (+...)
for positional capture, but I like the idea of using objects for naming, as well. I can see at least Tyre.conv
should allow support both for objects and tuples by conversion from pairs, so we can support both (eventually).
Another thing which bothered me after my suggestion was the fact that capturing grouping behaves different whether it contains nested captures. My suggestion was that, if not, then the whole content extracts the matched string as usual, otherwise it extract the properly typed tuple of subgroups. That was the point of the second last bullet, if not expressed clearly. I think this will be resolved by using the now non-capturing default groups for aggregation.
I think this will be fairly tidy when we impose a few sensible constraints. In particular a group used for aggregation should contain either named-only or unnamed-only immediate subgroups. There are thus a few uses of such groups:
(re1(?<v1>...)...reN(?<vN>)reNp1)
creates an object.(re1(+...)...reN(+...)re)
creates an N-tuple.(re1(+...)re2)
is just a special case of the above since the 1-tuple in ocaml is the value itself. This allows delimiting the operand of ?
, +
, *
, and {...}
where needed.(?&...)
and is allowed in the same contexts as (+...)
above, and the latter is just a (?&string)
from the type point of view.The (+...)
syntax is still tentative. I am not aware of any conflicts, but we could use (?X...)
for some X
to spend less of the available operator namespace.
About [%tyre {|...|}]
: Yes, this would suffice for what can be expressed by predefined conversions. I tried to also support Tyre.conv
but realise that it would require two callbacks, and thus a different and non-obvious syntax.
I think the parse tree might look something like
type t =
| Re of Re.t
| Seq of t list
| Alt of t list
| Repeat of int * int option * t
| Capture of t
| Capture_as of string * t
| Case_sense of t
| Case_blind of t
In case we want to distinguish ?
from {0,1}
, we'll also add an Opt
constructor. I started writing a parser; it should be easy to adjust the details later.
I would remove the Re
case (we will detect when it's appropriate to lower to Re
) and you need Call of string loc
Also, you definitely need an Opt
constructor, the semantics of Opt and repetition is slightly different regarding lazyness.
The Re
case is only used to avoid representing and parsing various non-recursive constructs including character sets including character classes, assertions including a list of escape sequences, and also for string literals. I think it's a convenient simplification to start with at least, we can expand the parse tree if we need to.
I'll add Call
and Opt
.
I pushed a preliminary version of the parser to the tyre branch.
I had a quick look. it looks like a decent start but there are still lot's of things missing and having something mildly spec-compliant is going to be quite some work. I must say I would just have started off from the parser in Re.Perl
and tweaked that until it outputs our AST instead. ;)
What do you want to do for alternatives?
If we have one named grouped per alternative, for instance ..(?<foo>..)..|..(?<bar>..)..|..(?<baz>..)..
, we can return [ `foo of ... | `bar of ... | `baz of ...]
. I'm not sure what to do about the unnamed case, or if we have more than one named group.
I link the idea of one named group per branch for alternatives. We should allow capitalized names here, maybe even mandate them to distinguish them from objects. In particular there will be an ambiguity for a single name, e.g. (a(?<x>...)b)+
. For unnamed alternatives, I propose [`Inj1 of t1 | ... | `InjN of tN]
.
I noticed I left out alternatives in the parser. Will add that in the evening. I don't intent to implement the full PCRE specs, which Re.Perl
does not either, but yes, keeping parts implemented close to the spec would be good to avoid surprising users. A few deviations was intended though, like rejecting some operator characters occurring as literal characters. I think those should be escaped. As long as there is an error, I don't see this as an issue. On the other hand, accepting same syntax with different semantics than PCRE would be bad.
Let's just say that on the subset of feature that we support, I would like the parser to be completely spec-compliant with pcre. Of course, we will not support all features (backreferences ...) and we change the semantics of grouping, but the rest should really be the same. Ideally, we would recognize things we don't handle and explicitely raise an error saying we don't handle it.
I just pushed a bunch of spagetti that turn Regexp.t
into Tyre code. It's completely untested, but the code for sequences and the interaction with named and unnamed groups is there.
I'm not sure what to do with the Re.t
. I think it's best to either actually parse it ourself and emit Re
code or to just get a string and emit Re.Perl.re "..."
By the way, if we want mildly decent error messages and merlin support, we are going to need locations for everything. :p
Your approach looks good.
I would think you could use Tyre.regexp
cast to unit
for the Re
constructor, since it only contains strings, chars, assertions, and csets? But in any case I think it was wrong to convert it at that point, since Regexp.t
is meant as a source-level tree, so I'll make some adjustments. That may also simplify be helpful when testing the parser.
Yes, we'll need locations.
Casting to unit
would be problematic since, as far as I can see, the parser will emit things of the form Capture (Re x)
. In any case, if a regular expression is not captured, Tyre.prefix
/suffix
will make Tyre
completely ignore the associated results/convs.
I added some code that tries to simplify regexps as much as possible by collasping things into the Re
case. This should make things simpler down the line.
It must remain string
of course. I can see you handled in the last commit, should we still change to something like
diff --git a/regexp.mli b/regexp.mli
index a626a95..c22712f 100644
--- a/regexp.mli
+++ b/regexp.mli
@@ -14,8 +14,18 @@
* along with this library. If not, see <http://www.gnu.org/licenses/>.
*)
+type cset = string
+(** For the time being this is the raw PCRE code including delimiters. *)
+
+type assertion =
+ [ `Bos | `Eos ]
+
type t =
- | Re of Re.t
+ | Assert of assertion
+ | Escape of string (* Raw PCRE escape including '\'. *)
+ | Cset of cset
+ | Char of char
+ | Str of string
| Seq of t list
| Alt of t list
| Opt of t
or do you think it's okay as is? Or maybe just change Re of Re.t
to Re_perl of string
?
About locations, I was initially thinking of using just storing offsets from the start of the regexp, but maybe it's simpler to pass the locations to the parser and compute the final locations there.
To add locations, this should do:
type 'a loc = 'a Location.loc
type t = node loc
and node =
| Re of Re.t
| Seq of t list
| Alt of t list
| Opt of t
| Repeat of int loc * int option loc * t
| Capture of t
| Capture_as of string loc * t
| Call of Longident.t loc
The location of the Repeat
range can be inferred from the locations of the arguments.
Update: Proper locations should to be done in the parser, since we need to handle newlines. ~This will probably clutter the code quite a bit.~ I'm also not sure if Lexing.pos_cnum
should count Unicode codepoints or bytes. I guess the latter, since UTF-8 encoding of source code is not officially supported.
Yes, That's exactly what I expected for locations. And yeah, I would do the location computations in the parser. As for utf8 code point ... be consistent with OCaml. :p
For now, I would actually say Re of 'a
and the parser returns a string
. The transformation will probably turn it into a Parsetree.expression
.
So, I added the parameter. Currently, the parser returns a Re.t Regexp.t
. I'll let you turn that in to a string loc Regexp.t
(or whatever else you consider appropriate).
The Regexp.t
parameter is now string, and I also made an untested adjustment of the tyre module to return expressions. And there are locations, with minor adjustments from the above.
Had a quick look at your changes. It looks great. Thanks! I implemented alternatives and put the basic boilerplate to start some testing.
First bug report: ab*
parses as (ab)*
:)
Achievement unlocked:
Fatal error: exception File "typing/typecore.ml", line 2862, characters 6-12: Assertion failed
Well, it's starting to look like something.
# [%tyre "(+ab)|(+[foo])*(?&Tyre.int)"]
- : [ `Alt0 of string | `Alt1 of string Tyre.gen * int ] Tyre.t
# [%tyre "(?<name>ab)[a-z](?<id>(?&Tyre.int))*"]
- : < id : int Tyre.gen; name : string > Tyre.t
Nice, you got it hooked up to a real PPX.
Yes, I realise when you point it out that postfix operators will apply to full string items. And I've forgotten .
! I'm working on qcheck based testing for Regexp
.
I'm working on qcheck based testing for Regexp.
Nice, that would be really great!
I added syntax for routing and I improved the handling of named pattern. Pretty happy with the result.
# let thing = [%tyre {|(?<foo>(?&Tyre.pos_int))|bar(?<bar>[:alpha:]+)|}] ;;
val thing : [ `bar of string | `foo of int ] Tyre.t
# function%tyre "potato(?&thing)" as x -> Ok x | "nope" as s -> Error s;;
- : ([ `bar of string | `foo of int ], string) result Tyre.re
Proposition for the parser: Parse (?&foo:bar)
as (?<foo>(?&bar))
.
Seems the rudimentary parts are in place then! BTW, do you use #directory
/#require
, #ppx
, or have you found a way to get the syntax loaded directly with dune utop?
I second (?&foo:bar)
, and I'll also relax the identifier restriction so that your example can be typed [`Bar of string | `Foo of int]
.
I just do the old school way:
jbuilder build tests/main.exe
jbuilder utop . -- -ppx _build/default/tests/main.exe -dsource
the main
file will be useful for top level tests at some point as well.
I added (?&foo:bar)
, (?#comment)
and made some fixes and adjustments to the RE parser. I think it's fairly feature-complete now.
I also updated %pcre
to use the new parser, fixing capture under alternatives. I'll will leave %pcre
at that for now.
Should patterns be anchored? (Or left-anchored as in Python?)
Nice.
No don't anchor. First off, you can compose them, so you would need to choose where to anchor carefully. Furthermore, I think it's just not useful to anchor by default. People can anchor manually easily enough.
Okay, I think I agree, just wanted your opinion, since I think it would have been a natural choice for string-matching routes if not based on regular expressions.
The first version of Tyre actually had this behavior of anchoring by default. It was not a good idea. :)
Before the release we should have a test_ppx_regexp_tyre.ml
, similar (and better!) than the test_ppx_regexp_pcre.ml
, which gives us an opportunity to hand craft some corner cases. I'll be off to some paid work first.
What do you want to do, packaging wise ? Have ppx_regexp
with both pcre and tyre, or two packages, ppx_regexp
and ppx_tyre
(which would avoid the dependency to tyre for the normal pcre thing) ? Also, do you intend to keep both, or deprecate pcre for tyre ?
I'm not sure having both in the long run really make sense. Ideally, the tyre version would do everything the pcre version do. Of course, we still have to tweak the ergonomics of the tyre version to make sure that's the case. :)
Though question about packaging.
Your other question is simpler; I indent to keep the pcre variant. My own main use case for PPX based regular expressions has been to parse log files and similar quick & dirty data mining tasks. Typically the data is loosely structured and a simple RE suffice to extract a few fields. The speed is more important than the fine details. (Though next time there is a list involved, I can use tyre instead of String.split_on_char
-post-processing the match. In any case, the tyre dependency is acceptable.)
One might still image wanting to use REs for building simple parsers for packaged software, in which case minimising dependencies is preferable. But here, the tyre variant is probably the more likely choice anyway?
That is, I think both options are okay.
I presume we'll keep one repo in either case, to avoid independent maintenance of regexp.ml
and unit tests. Different opam packages would allow separate versioning and release, e.g. in case you want to de releases for ppx_tyre
.
That brings us to another question: You probably want to keep the write access to the repo? Should we try to get it into an org like ocaml-ppx?
Your other question is simpler; I indent to keep the pcre variant. [...] The speed is more important than the fine details.
Alright, but since you mention it I have to react: the cost added by tyre is often negligible compared to the cost of regex matching, especially for simple tuples and sums. I also made some experiments.
I presume we'll keep one repo in either case
Absolutely.
Different opam packages would allow separate versioning and release
It would also avoid breaking current compat and perhaps provide more flexibility in the future. It might also be advantageous from a purely advertising point of view. :p
You probably want to keep the write access to the repo? Should we try to get it into an org like ocaml-ppx?
That's up to you. I'm fine with everything staying under your name (as long as I have the commit bit ofc).
Let's make two packages then. Ideally, I would like to rename the other one to ppx_pcre
for consistency. I'll think about that, since one can't quite delete opam packages.
And I'll keep the repo under my name for now.
On second though, a different release schedule would mean having named version tags and separate archive names, which I think will require some hacking on release scripts. So, I feel like backing out of that right now, though a lightweight alternative would be to do common tar releases and update the either or both opam packages for each tarball release. That means skipping over versions, but saves unnecessary recompilation if only one of the flavours was changed.
About the ergonomics, a secondary motivation for %pcre
was to have it lift RE compilation out of the function for efficiency. That allows using match%pcre
deep within a function without worrying about excessive recompilation. I made a separate ppx_static
just before writing this PPX, but didn't publish it since RE compilation was the main use-case.
I believe dune-release
should handle desynchronized releases just fine. I would advise you to keep the same package name (and keep compat in general). The functionality for the non-tyre part doesn't change so fundamentally.
About the ergonomics
That's why I don't think I'll provide match%tyre
, only function%tyre
, so that people think about where they'll place it. In any case, in re
, the compilation doesn't do all that much. Determinization is done online lazily. The real important thing is to re-use the same compiled regex, which we do.
Yes, I came to the conclusion of keeping the name too based on the question where the PPXes would end up if more Re.*
flavours were supported.
Are we okay with releasing ppx_tyre
in its present form? Maybe schedule a release for Monday?
If you want to do the ppx_tyre
releases, it would be worth investigating the proper way of doing desynchronized releases. (I use topkg-care now, so updating the scripts to use dune-release should be straight forward, but I'm just not sure what the best practise is. E.g. would one use tags like ppx_tyre/v0.4.0
or similar and how would it affect dune-release and GitHub releases.)
I would prefer to add tests (and maybe to try to improve error messages/locations) before release. Give a few more days.
Let's stick to synchronize releases for now, since it'll be simpler. We'll see how it goes later! :)
I'm currently considering taking a second look at tyre, and that involves writing a ppx for it.
ppx_regexp
does, as far as I'm concerned, many things right and I agree with the goal to make it accessible. So I'm considering simply extending it. That involves two things:_ list
for groups under repetitions and`Left of _ | `Right of _
for groups under alternatives. See if the result is sufficiently usable, and maybe modify Tyre so that it is.Now, I recently got a decent idea for 2: Reuse the syntax for recursion into named groups from PCRE2.
(?&potato)
would simply use the (typed) regular expression namedpotato
. This would also allow to have more primitive datatypes, for example(?&int)
.What do you think? Do you want to join forces? :)