paurkedal / ppx_regexp

Matching Regular Expressions with OCaml Patterns
GNU Lesser General Public License v3.0
57 stars 3 forks source link

Working with tyre #2

Closed Drup closed 6 years ago

Drup commented 6 years ago

I'm currently considering taking a second look at tyre, and that involves writing a ppx for it. ppx_regexp does, as far as I'm concerned, many things right and I agree with the goal to make it accessible. So I'm considering simply extending it. That involves two things:

  1. Modifying the typing rules to return _ list for groups under repetitions and `Left of _ | `Right of _ for groups under alternatives. See if the result is sufficiently usable, and maybe modify Tyre so that it is.
  2. Figuring out composition.

Now, I recently got a decent idea for 2: Reuse the syntax for recursion into named groups from PCRE2. (?&potato) would simply use the (typed) regular expression named potato. This would also allow to have more primitive datatypes, for example (?&int).

What do you think? Do you want to join forces? :)

paurkedal commented 6 years ago

Yes, this PPX would use a revisit too. My contribution will be scattered this week, but I should have more time after that. The current library could use a full RE parser, which hopefully will be usable for both tyre and the current Re.Pcre based patterns by keeping the syntax close enough.

About composition for tyre, do you have an idea what kind of value x should capture for a match of (?<x>T(?&int){2}:(?&float))? For a classic RE, it would of course capture whole group as a string, but for tyre the question is whether it would not be more natural to ignore the literal parts and bind x to an (int * int) * float or int list * float tuple.

Drup commented 6 years ago

Don't worry, in the future means "in the coming weeks" ;)

As for your example: Ignoring the literal parts is easy and so it'll definitely do one of those two types. The current version doesn't have bounded repetitions, precisely because it's not clear which version you want! int list is easier to implement, but I suppose int * int is more useful. I was considering providing a tuple module for these kind of things. Unfortunately you can't really implement tuple : int -> 'a t -> _ t.

Drup commented 6 years ago

By the way, this is buggy in the current version of ppx_regexp

let test3 s =
  (match%pcre s with
   | {|(?<a>[0-9]*)|(?<b>[a-z]*)|} -> assert (a ^ b = s)
   | s' -> assert (s = s'))

Tyre handles that by returning something of type [> `Left of int | `Right of string]. The fact that you can give different names to the groups in the various branches is slightly bothersome.

paurkedal commented 6 years ago

Yep, I realized that while commenting on your earlier post. It's one reason I need a proper parser. For %pcre it should just return an option, though. This semantic is not as nice as tyre, but I think it's a fairly faithful view of the built-in capabilities of re.

On a minor note, I think it would be good with a real either (or sum) type if we could guess what the stdlib might include in a future version.

Another trick question about the tyre variant: How should x be bound in a nested aggregation like (((?<x>[^;\n]+);)*\n)*? Should this be allowed at all? I think a sound interpretation is that x represents a suitable projection operator mapped though the surrounding aggregations. On the other hand, this is already accessible be explicit mapping if the capture is done at the topmost level.

Drup commented 6 years ago

Another trick question about the tyre variant: How should x be bound in a nested aggregation like (((?<x>[^;\n]+);)*\n)*?

# Tyre.(rep (rep (posix "[^;\n]+") <* str";") <* str "\n") ;;
- : string Tyre.gen Tyre.gen Tyre.t

a <*b matches a, then b, returns the result of the left and drops the result from the right. This is barely tricky, it's what I build Tyre for :D

paurkedal commented 6 years ago

Yes, but the trick part was about how to intepret the variable binding under the two star operators. Note also that there could have been a second variable binding parallel to (?<x>...) or at a different level. If we only allow capturing the full result into a single variable, then the correspondence with tyre route is straight forward, but I can see tree levels of support for binding variables:

In the first case, the variable name [corr.: does not] need to be inside the regexp source.

Drup commented 6 years ago

Initially, I would suggest the following restriction: There can not be nested groups. That should solve these kind of issues, since it means Regexs inside groups are regular PCREs that can be given to Re. Constructs that are not inside a group are re-interpreted as Tyre operators.

Later, we could consider relaxing this restriction with a rather easy-to-add Tyre function with_str : 'a t -> ('a * string) t.

paurkedal commented 6 years ago

I added you as a collaborator, so that we can commit directly. Let's work on a separate branch tyre, as I link to keep master in good shape. I'll commit the non-tyre parts to the same branch, as they will likely be needed. Esp. I think we can share the same parse tree and parser, though I'll have some redundancy to support both the current and new flavours, maybe including positions to allow fine-grained error reporting.

I'm not quite sure what you had in mind yet for the tyre syntax, including the use of with_str. My though after your last comment would be to allow nested groups, which would produce tuples, lists, and options for tyre, but with only top-level named capture. E.g. (?<x>a*(b(c+)d+(?&int)?)*))f would bind x to (string * int option) list because:

There will be some details to resolve if we go with this plan.

In addition to function%tyre and match%tyre, we might want to support constructing new converting regular expressions with something like

let%tyre qname = fun {|(?<x0>\w+)(?<xs>(\.(\w+))+)|} -> x0 :: xs

Notes on this:

Are we on the same page? Or what are your thoughts?

Drup commented 6 years ago

hmmm, your semantics for capturing might work. I'm always highly annoyed by the fact that parens-for-priority and parens-for-groups are overloaded. I was going to take the opposite stance, which is that only named groups are captured, including when nested. Other parens are only used for priority, not capturing.

More concretely, your example would be written a*(b(?<foo>c+)d+(?<bar>(?&int)?))*f and would return an element of type < foo : string ; bar : int option> list. Either that, or some form of hmap. This allows to nest things easily. We could consider having toplevel names bound directly to variables instead of using something structural. I'm not completely sure yet.

As for defining new regular expressions. Just [%tyre {|....|}] will do. With let%tyre foo = {|...|} as a shortcut. foo will be a real identifier of type Tyre.t anyway.

paurkedal commented 6 years ago

I agree! It's better if the short-syntax grouping is non-capturing. I was going to suggest (?<>...) or even (+...) for positional capture, but I like the idea of using objects for naming, as well. I can see at least Tyre.conv should allow support both for objects and tuples by conversion from pairs, so we can support both (eventually).

paurkedal commented 6 years ago

Another thing which bothered me after my suggestion was the fact that capturing grouping behaves different whether it contains nested captures. My suggestion was that, if not, then the whole content extracts the matched string as usual, otherwise it extract the properly typed tuple of subgroups. That was the point of the second last bullet, if not expressed clearly. I think this will be resolved by using the now non-capturing default groups for aggregation.

I think this will be fairly tidy when we impose a few sensible constraints. In particular a group used for aggregation should contain either named-only or unnamed-only immediate subgroups. There are thus a few uses of such groups:

The (+...) syntax is still tentative. I am not aware of any conflicts, but we could use (?X...) for some X to spend less of the available operator namespace.

About [%tyre {|...|}]: Yes, this would suffice for what can be expressed by predefined conversions. I tried to also support Tyre.conv but realise that it would require two callbacks, and thus a different and non-obvious syntax.

paurkedal commented 6 years ago

I think the parse tree might look something like

type t =
  | Re of Re.t
  | Seq of t list
  | Alt of t list
  | Repeat of int * int option * t
  | Capture of t
  | Capture_as of string * t
  | Case_sense of t
  | Case_blind of t

In case we want to distinguish ? from {0,1}, we'll also add an Opt constructor. I started writing a parser; it should be easy to adjust the details later.

Drup commented 6 years ago

I would remove the Re case (we will detect when it's appropriate to lower to Re) and you need Call of string loc

Drup commented 6 years ago

Also, you definitely need an Opt constructor, the semantics of Opt and repetition is slightly different regarding lazyness.

paurkedal commented 6 years ago

The Re case is only used to avoid representing and parsing various non-recursive constructs including character sets including character classes, assertions including a list of escape sequences, and also for string literals. I think it's a convenient simplification to start with at least, we can expand the parse tree if we need to.

I'll add Call and Opt.

paurkedal commented 6 years ago

I pushed a preliminary version of the parser to the tyre branch.

Drup commented 6 years ago

I had a quick look. it looks like a decent start but there are still lot's of things missing and having something mildly spec-compliant is going to be quite some work. I must say I would just have started off from the parser in Re.Perl and tweaked that until it outputs our AST instead. ;)

Drup commented 6 years ago

What do you want to do for alternatives?

If we have one named grouped per alternative, for instance ..(?<foo>..)..|..(?<bar>..)..|..(?<baz>..).., we can return [ `foo of ... | `bar of ... | `baz of ...]. I'm not sure what to do about the unnamed case, or if we have more than one named group.

paurkedal commented 6 years ago

I link the idea of one named group per branch for alternatives. We should allow capitalized names here, maybe even mandate them to distinguish them from objects. In particular there will be an ambiguity for a single name, e.g. (a(?<x>...)b)+. For unnamed alternatives, I propose [`Inj1 of t1 | ... | `InjN of tN].

I noticed I left out alternatives in the parser. Will add that in the evening. I don't intent to implement the full PCRE specs, which Re.Perl does not either, but yes, keeping parts implemented close to the spec would be good to avoid surprising users. A few deviations was intended though, like rejecting some operator characters occurring as literal characters. I think those should be escaped. As long as there is an error, I don't see this as an issue. On the other hand, accepting same syntax with different semantics than PCRE would be bad.

Drup commented 6 years ago

Let's just say that on the subset of feature that we support, I would like the parser to be completely spec-compliant with pcre. Of course, we will not support all features (backreferences ...) and we change the semantics of grouping, but the rest should really be the same. Ideally, we would recognize things we don't handle and explicitely raise an error saying we don't handle it.

Drup commented 6 years ago

I just pushed a bunch of spagetti that turn Regexp.t into Tyre code. It's completely untested, but the code for sequences and the interaction with named and unnamed groups is there.

I'm not sure what to do with the Re.t. I think it's best to either actually parse it ourself and emit Re code or to just get a string and emit Re.Perl.re "..."

By the way, if we want mildly decent error messages and merlin support, we are going to need locations for everything. :p

paurkedal commented 6 years ago

Your approach looks good.

I would think you could use Tyre.regexp cast to unit for the Re constructor, since it only contains strings, chars, assertions, and csets? But in any case I think it was wrong to convert it at that point, since Regexp.t is meant as a source-level tree, so I'll make some adjustments. That may also simplify be helpful when testing the parser.

Yes, we'll need locations.

Drup commented 6 years ago

Casting to unit would be problematic since, as far as I can see, the parser will emit things of the form Capture (Re x). In any case, if a regular expression is not captured, Tyre.prefix/suffix will make Tyre completely ignore the associated results/convs.

I added some code that tries to simplify regexps as much as possible by collasping things into the Re case. This should make things simpler down the line.

paurkedal commented 6 years ago

It must remain string of course. I can see you handled in the last commit, should we still change to something like

diff --git a/regexp.mli b/regexp.mli
index a626a95..c22712f 100644
--- a/regexp.mli
+++ b/regexp.mli
@@ -14,8 +14,18 @@
  * along with this library.  If not, see <http://www.gnu.org/licenses/>.
  *)

+type cset = string
+(** For the time being this is the raw PCRE code including delimiters. *)
+
+type assertion =
+  [ `Bos | `Eos ]
+
 type t =
-  | Re of Re.t
+  | Assert of assertion
+  | Escape of string (* Raw PCRE escape including '\'. *)
+  | Cset of cset
+  | Char of char
+  | Str of string
   | Seq of t list
   | Alt of t list
   | Opt of t

or do you think it's okay as is? Or maybe just change Re of Re.t to Re_perl of string?

About locations, I was initially thinking of using just storing offsets from the start of the regexp, but maybe it's simpler to pass the locations to the parser and compute the final locations there.

paurkedal commented 6 years ago

To add locations, this should do:

type 'a loc = 'a Location.loc

type t = node loc
and node =
  | Re of Re.t
  | Seq of t list
  | Alt of t list
  | Opt of t
  | Repeat of int loc * int option loc * t
  | Capture of t
  | Capture_as of string loc * t
  | Call of Longident.t loc

The location of the Repeat range can be inferred from the locations of the arguments.

Update: Proper locations should to be done in the parser, since we need to handle newlines. ~This will probably clutter the code quite a bit.~ I'm also not sure if Lexing.pos_cnum should count Unicode codepoints or bytes. I guess the latter, since UTF-8 encoding of source code is not officially supported.

Drup commented 6 years ago

Yes, That's exactly what I expected for locations. And yeah, I would do the location computations in the parser. As for utf8 code point ... be consistent with OCaml. :p

For now, I would actually say Re of 'a and the parser returns a string. The transformation will probably turn it into a Parsetree.expression.

Drup commented 6 years ago

So, I added the parameter. Currently, the parser returns a Re.t Regexp.t. I'll let you turn that in to a string loc Regexp.t (or whatever else you consider appropriate).

paurkedal commented 6 years ago

The Regexp.t parameter is now string, and I also made an untested adjustment of the tyre module to return expressions. And there are locations, with minor adjustments from the above.

Drup commented 6 years ago

Had a quick look at your changes. It looks great. Thanks! I implemented alternatives and put the basic boilerplate to start some testing.

First bug report: ab* parses as (ab)* :)

Drup commented 6 years ago

Achievement unlocked:

Fatal error: exception File "typing/typecore.ml", line 2862, characters 6-12: Assertion failed
Drup commented 6 years ago

Well, it's starting to look like something.

# [%tyre "(+ab)|(+[foo])*(?&Tyre.int)"]
- : [ `Alt0 of string | `Alt1 of string Tyre.gen * int ] Tyre.t
# [%tyre "(?<name>ab)[a-z](?<id>(?&Tyre.int))*"]
- : < id : int Tyre.gen; name : string > Tyre.t
paurkedal commented 6 years ago

Nice, you got it hooked up to a real PPX.

Yes, I realise when you point it out that postfix operators will apply to full string items. And I've forgotten .! I'm working on qcheck based testing for Regexp.

Drup commented 6 years ago

I'm working on qcheck based testing for Regexp.

Nice, that would be really great!

Drup commented 6 years ago

I added syntax for routing and I improved the handling of named pattern. Pretty happy with the result.

# let thing = [%tyre {|(?<foo>(?&Tyre.pos_int))|bar(?<bar>[:alpha:]+)|}] ;;
val thing : [ `bar of string | `foo of int ] Tyre.t
# function%tyre "potato(?&thing)" as x -> Ok x | "nope" as s -> Error s;;
- : ([ `bar of string | `foo of int ], string) result Tyre.re
Drup commented 6 years ago

Proposition for the parser: Parse (?&foo:bar) as (?<foo>(?&bar)).

paurkedal commented 6 years ago

Seems the rudimentary parts are in place then! BTW, do you use #directory/#require, #ppx, or have you found a way to get the syntax loaded directly with dune utop?

I second (?&foo:bar), and I'll also relax the identifier restriction so that your example can be typed [`Bar of string | `Foo of int].

Drup commented 6 years ago

I just do the old school way:

jbuilder build tests/main.exe
jbuilder utop . -- -ppx _build/default/tests/main.exe -dsource

the main file will be useful for top level tests at some point as well.

paurkedal commented 6 years ago

I added (?&foo:bar), (?#comment) and made some fixes and adjustments to the RE parser. I think it's fairly feature-complete now.

I also updated %pcre to use the new parser, fixing capture under alternatives. I'll will leave %pcre at that for now.

Should patterns be anchored? (Or left-anchored as in Python?)

Drup commented 6 years ago

Nice.

No don't anchor. First off, you can compose them, so you would need to choose where to anchor carefully. Furthermore, I think it's just not useful to anchor by default. People can anchor manually easily enough.

paurkedal commented 6 years ago

Okay, I think I agree, just wanted your opinion, since I think it would have been a natural choice for string-matching routes if not based on regular expressions.

Drup commented 6 years ago

The first version of Tyre actually had this behavior of anchoring by default. It was not a good idea. :)

paurkedal commented 6 years ago

Before the release we should have a test_ppx_regexp_tyre.ml, similar (and better!) than the test_ppx_regexp_pcre.ml, which gives us an opportunity to hand craft some corner cases. I'll be off to some paid work first.

Drup commented 6 years ago

What do you want to do, packaging wise ? Have ppx_regexp with both pcre and tyre, or two packages, ppx_regexp and ppx_tyre (which would avoid the dependency to tyre for the normal pcre thing) ? Also, do you intend to keep both, or deprecate pcre for tyre ?

I'm not sure having both in the long run really make sense. Ideally, the tyre version would do everything the pcre version do. Of course, we still have to tweak the ergonomics of the tyre version to make sure that's the case. :)

paurkedal commented 6 years ago

Though question about packaging.

Your other question is simpler; I indent to keep the pcre variant. My own main use case for PPX based regular expressions has been to parse log files and similar quick & dirty data mining tasks. Typically the data is loosely structured and a simple RE suffice to extract a few fields. The speed is more important than the fine details. (Though next time there is a list involved, I can use tyre instead of String.split_on_char-post-processing the match. In any case, the tyre dependency is acceptable.)

One might still image wanting to use REs for building simple parsers for packaged software, in which case minimising dependencies is preferable. But here, the tyre variant is probably the more likely choice anyway?

That is, I think both options are okay.

I presume we'll keep one repo in either case, to avoid independent maintenance of regexp.ml and unit tests. Different opam packages would allow separate versioning and release, e.g. in case you want to de releases for ppx_tyre.

That brings us to another question: You probably want to keep the write access to the repo? Should we try to get it into an org like ocaml-ppx?

Drup commented 6 years ago

Your other question is simpler; I indent to keep the pcre variant. [...] The speed is more important than the fine details.

Alright, but since you mention it I have to react: the cost added by tyre is often negligible compared to the cost of regex matching, especially for simple tuples and sums. I also made some experiments.

I presume we'll keep one repo in either case

Absolutely.

Different opam packages would allow separate versioning and release

It would also avoid breaking current compat and perhaps provide more flexibility in the future. It might also be advantageous from a purely advertising point of view. :p

You probably want to keep the write access to the repo? Should we try to get it into an org like ocaml-ppx?

That's up to you. I'm fine with everything staying under your name (as long as I have the commit bit ofc).

paurkedal commented 6 years ago

Let's make two packages then. Ideally, I would like to rename the other one to ppx_pcre for consistency. I'll think about that, since one can't quite delete opam packages.

And I'll keep the repo under my name for now.

On second though, a different release schedule would mean having named version tags and separate archive names, which I think will require some hacking on release scripts. So, I feel like backing out of that right now, though a lightweight alternative would be to do common tar releases and update the either or both opam packages for each tarball release. That means skipping over versions, but saves unnecessary recompilation if only one of the flavours was changed.

paurkedal commented 6 years ago

About the ergonomics, a secondary motivation for %pcre was to have it lift RE compilation out of the function for efficiency. That allows using match%pcre deep within a function without worrying about excessive recompilation. I made a separate ppx_static just before writing this PPX, but didn't publish it since RE compilation was the main use-case.

Drup commented 6 years ago

I believe dune-release should handle desynchronized releases just fine. I would advise you to keep the same package name (and keep compat in general). The functionality for the non-tyre part doesn't change so fundamentally.

About the ergonomics

That's why I don't think I'll provide match%tyre, only function%tyre, so that people think about where they'll place it. In any case, in re, the compilation doesn't do all that much. Determinization is done online lazily. The real important thing is to re-use the same compiled regex, which we do.

paurkedal commented 6 years ago

Yes, I came to the conclusion of keeping the name too based on the question where the PPXes would end up if more Re.* flavours were supported.

Are we okay with releasing ppx_tyre in its present form? Maybe schedule a release for Monday?

If you want to do the ppx_tyre releases, it would be worth investigating the proper way of doing desynchronized releases. (I use topkg-care now, so updating the scripts to use dune-release should be straight forward, but I'm just not sure what the best practise is. E.g. would one use tags like ppx_tyre/v0.4.0 or similar and how would it affect dune-release and GitHub releases.)

Drup commented 6 years ago

I would prefer to add tests (and maybe to try to improve error messages/locations) before release. Give a few more days.

Let's stick to synchronize releases for now, since it'll be simpler. We'll see how it goes later! :)