mirage / ocaml-uri

RFC3986 URI parsing library for OCaml
Other
97 stars 57 forks source link

Rework the URI parser with Angstrom #142

Closed anmonteiro closed 3 years ago

anmonteiro commented 4 years ago

This diff reworks the current parser based on regexes in favor of one that uses Angstrom.

It also removes the dependency on ocaml-re, moving the regular expressions to a new package uri-re.

The current state of this work aready passes the unit tests, but it might need to further work to get it into a mergeable state. I'm opening a PR to get any and all feedback. Thanks in advance!

This happens to fix https://github.com/mirage/ocaml-uri/issues/21.

dinosaure commented 4 years ago

Interesting work. I would like to now that you translated already available regexp to angstorm parser or you took what is described by the RFC as is?

anmonteiro commented 4 years ago

@dinosaure This is mostly based on what the previous implementation did. I was reading the spec along with the work and noticed that we don't follow it too closely in some places.

My rationale for preserving the same assumptions was that it's already a fairly big change, and I didn't want to introduce more noise than necessary.

I think this sets up a good base to iterate on the parts that need to follow the spec more cloesly. What do you think?

dinosaure commented 4 years ago

I think this sets up a good base to iterate on the parts that need to follow the spec more cloesly. What do you think?

I agree with you but ocaml-uri is widely used by many people. So we should have something to test your PR with others packages and see what happens first. Just to avoid any large break :+1: .

avsm commented 4 years ago

Thanks for the excellent work on this PR! I'm just getting back from vacation and can review it more over the next week.

I'm in favour of merging and releasing an Angstrom version that is as close to the behaviour of uri as currently stands, and then improving the specification compat in subsequent releases. How about setting up an AFL fuzz test to check for behavioural differences between uri-re and uri to spot any places where it differs?

avsm commented 4 years ago

I've pushed avsm/ocaml-uri@4bf0535dea13ed98d92d6dd8fb8402d50bbbaae0 over to https://github.com/avsm/ocaml-uri/tree/angstrom-parser that implements the differential fuzzer. So far, it's only found one difference: the handling of \n in inputs. The old implementation would terminate the parse upon hitting a newline, but the new one will urlencode the \n and continue consuming input. We could make it terminate in the new implementation to maintain existing behaviour, and subsequently allow them in a later release.

anmonteiro commented 4 years ago

This is great! I can add you to my fork if you wanna push there. I can look into fixing that difference soon too.

avsm commented 4 years ago

pushed :)

avsm commented 4 years ago

Any further thoughts on that difference Antonio? It would be good to get this into a release

anmonteiro commented 4 years ago

Thanks for the ping -- I had forgotten about this but I'm still interested in it. Give me a couple days and I'll look into this.

avsm commented 4 years ago

Much appreciated @anmonteiro! No rush of course :-)

anmonteiro commented 4 years ago

@avsm Just pushed a commit that fixes the differences in newline handling between the old (regex) and new (Angstrom) parsers. Also added a compatibility test.

hcarty commented 4 years ago

Thank you for working on this @anmonteiro!

dinosaure commented 4 years ago

Let's revive this PR.