haskell / network-uri

URI manipulation facilities
Other
25 stars 33 forks source link

add smart constructors `mkURIAuth` and `mkURI` #35

Closed adnelson closed 6 years ago

adnelson commented 6 years ago

I ran into some weird behavior when constructing URI and URIAuth objects directly, and I realized it was because I wasn't including, for example, the trailing : in my uriScheme, or the preceding / on my path. The end result is a value of URI which is not in fact a valid URI (e.g. calling parseURI . show returns Nothing).

This pull request adds "smart constructors" for URI and URIAuth with the goal of making it easier to construct valid URIs. My personal preference would be to update the URI and URIAuth types to be more typesafe (e.g. make the uriPort field be an Int or even a Word), but that would break tons of extant Haskell code, so providing a smart constructor seems like a decent compromise. There could be better ways to do this; the mkURI function doesn't guarantee that the result is a parsable URI (e.g. the scheme could be an empty string), but for common uses it seems like it would make valid URI construction a little easier.

In addition I added a nullAuth constant, which seems handy.

adnelson commented 6 years ago

Looks like the tests failed unrelated to my changes, some very rare-to-trigger quickcheck case?

  propEscapeUnEscapeLoop: [Failed]
*** Failed! Falsifiable (after 81 tests and 6 shrinks): 
"\65534"
(used seed 4162229684706710984)
ezrakilty commented 6 years ago

I sympathize with the problem here—but I'm reluctant to add these new constructors which are so different in flavor from the existing ones.

I'm thinking of adding constructors and accessors which are exactly parallel to the existing ones, but which don't allow, and respectively don't emit, the nuisance delimiters.

What do you think about that idea?

ezrakilty commented 6 years ago

Thinking out loud: It would be nice if the new items could sit in a different module but use the same names; then one could import the dialect one wants to use and just use it without fussing around with the names. But this isn't easy because we want to produce data that's compatible with the existing data type, and we'd want the record-update notation to work, so we couldn't simply replace the accessors with new functions of the same names.

Given those problems, maybe it's better to give the new constructor/accessors distinct names...

ezrakilty commented 6 years ago

Another idea is to have a rectify : URI -> URI function on URIs. It would, for example, append ":" to the end of the scheme if it doesn't already have one. So you would construct URIs in the usual way, with the URI constructor, and then rectify them afterwards. (I choose the name "rectify" because "URI normalization" already has a meaning.)

I'm leaning toward this option, because when I thought about having an alternative constructor, I realized that you wouldn't be able to name the arguments. I played with one like

nURI :: String -> Maybe (String, String, String) -> String -> String -> String -> URI

But that's a little awkward.

adnelson commented 6 years ago

Hey, sorry I hadn't checked on this in a bit. I can totally see the objections you're raising; and yeah it's a bit tough.

My original idea was more or less the second thing you mentioned; I envisioned a Network.URI.Safe (or something) module which provided its own URI and URIAuth types that were more resistant to malformation, kind of following the "correct by construction" philosophy of Haskell. Although, it might be good to give them different names to reduce confusion. The reason I didn't go this route is because I didn't want to have to reimplement all of the parsing methods. Also, I feel like you'd still want some kind of smart constructors for these, so as to prevent things like empty string schemes etc. (The idea being, if they don't parse they shouldn't be able to be constructed).

The rectify method could work well. It might need to return an Either or Maybe in case the URI can't be made to be correct... I could throw something like that together, if that's desirable.

ezrakilty commented 6 years ago

Check out what I've got over here: https://github.com/haskell/network-uri/tree/rectify or the change https://github.com/haskell/network-uri/commit/3dca2ee5892f551738ec2d5a3a852190425cf3d4

It doesn't return Maybe, because all it does is conditionally add a character here or there. It doesn't guarantee a correct URI, but it does fill in the pesky separator characters. True, it would be nice to have correct-by-construction constructors, but since the set of URI schemes is open-ended, we probably couldn't force absolute correctness across all of them. I think there's a nice power-to-weight ratio of adding something to obviate putting in the separators manually.

One thing that might surprise you in the rectify change linked above is that I didn't have it prepend a slash for the path segment. The spec allows paths that don't start with a slash, and that fact is used by schemes like tag:, so I didn't want to be too heavy-handed with it. I think all the other separators that I'm inserting are effectively required in the URI spec itself (colon terminating the scheme, at-sign terminating the userinfo in the authority, colon initiating the port, question mark for the query and hash for the fragment—all applying only when the part is nonempty).

What do you think, does it wind up being usable?

adnelson commented 6 years ago

That's looking good! I think it's a definite improvement. I'm surprised that the path doesn't have to start with a slash, but on the other hand I don't have a super deep understanding of the spec. Is it still OK not to have the slash if there's a URIAuth section? Because otherwise it seems like it would be ambiguous (an auth of www.foo.com and no path would be the same as an auth of www.foo.co and a path of m)

adnelson commented 6 years ago

Another option is to have a show-type function which will render the URI to a string rectified. That might be preferable to something which manipulates the URI itself, as most likely the reason that a rectified URI is desirable in the first place is so that it can be rendered as a string. Could be something like

-- along with other variants which mirror the current printing functions
renderURI :: URI -> String

-- property, perhaps subject to some caveats
forall uri. parseURI (renderURI uri) == Just uri
ezrakilty commented 6 years ago

So, you asked a question about what happens if you have a uriRegName of say "www.foo.co" and a uriPath of "m", and that would map to a different URI than the one you wanted. There are lots of ways to construct invalid Network.URI objects, or problematic ones like that. I'd say the only proper ones have the property that parseURI . show == id if you know what I mean. And there is really nothing to ensure that you construct a proper one if you go through the URI constructor (as opposed to starting with a string and parseURI).

The URI spec defines a URI as "a sequence of characters" under a certain grammar and defines an interpretation of those sequences. I think the Network.URI package was designed to work in those terms, as evidenced by the fact that the Show instance goes straight back to the string form.

It's actually true that, if you have an authority component, then you have to have an "abempty" path, which means either a leading slash or no path at all. So we could consider having the rectify function take the existence of an authority component into consideration, and prepend a slash if there is an authority. (Note BTW that an authority component per se only exists if the URI has a // in it! Weird, right? And tag: URIs for example have no // and therefore end up being "all path" even though they usually have a domain name in them... I believe that's correct to the spec, and also how Network.URI works.)

To the idea about offering a sort of “showRectified” function… that’s interesting. But I like your original idea of “fixing” the URI objects themselves. Further processing on URI objects would reasonably assume that they are well-formed and the rectify function helps you create well-formed objects. The alternative is that you pass these URI objects around that don’t represent valid URIs, and that seems like a hazard.

I'm going to check in my rectify function but I don't plan to do a release right away. Perhaps we can gain a touch of experience with it before performing a release.

JulianLeviston commented 6 years ago

@ezrakilty Before I begin commenting, let me say this library is excellent. Thankyou for working on it!

I've been having similar issues to the above in trying to use URIs like the one contained in this variable assigned string theURIString = "active:/app/yeah/cool+operator@this+operand@ffcpl:%2f%2fwww.woot".

Firstly, upon cracking open the hood, I was a bit surprised that the separator strings are actually included in the internal data when I dug into this library. What do you think of the possibility of making it so that the data is a) parsed out of the separators entirely when parsing b) stored separately (for example, the paths stored independently of their separators rather than as a string, etc.) c) only later combined as part of the pretty-printing functionality (ie the show instance) which is dependent on any particular schemes in use.

Secondly, this came about primarily because I wanted to parse the above URI, but also I wanted to pull the key/value pairs that the query params encode into an association list or map so that I can have some code that responds based on those values pieces, and I didn't want to have two parsing phases, so I was hoping to extend or adjust the parsers in Network.URI, by grabbing a combinator and adding some extra functionality to the combinators, but I found it difficult to do that (because the necessary parsing functions aren't exposed as a separate library).

That is, this library almost provides what I want, I just wanted to use all the parsing combinators to parse with slightly different separators, and into my own internal data type, and pretty-print properly according to slightly different rules (and hopefully not record the specific separators as it does so).

Thirdly, I would really like if the implementation details were hidden from the outside (ie how things are stored / structured into data types, etc) for the "general use" library, and interface functions used instead... then the internal pieces (ie the parsing combinators) are put into one or two modules that Network.URI itself consumes which does the parsing and the pretty printing, possibly with installable parsers and pretty printers, then maybe the parsers/pretty printers that aren't general get put into a different library, then the interface functions built with that middle library. Tests could then be written against the outer interface rather than having knowledge of the guts of the data types or how it's written, etc. (I think at the moment the tests all use the internal data type directly, which couples the tests fairly tightly to the internal representation of the data).

That way, when someone wanted to take the work and extend it, say, build a similar but slightly different front, all they'd have to do is build "their own" library by importing your main library, then copy the "interface" code into their project and adjust. (ie use the library to build their own interface code).

Ah, this makes me wish Haskell had a module system where modules where first class functions and we could pass them around and extend them more easily by letting be extended through arguments.. oh well.

Below is a more in-depth examination of what I'm trying to do in one specific case.

That is, I'd like, for example, that the query params were separated by a custom character which is dependant on the scheme (and possibly installable upon use)... so, in this case we're using + instead of ? to separate the query params from the path segments, and for the key/values of params to be separated also by + rather than the more traditionally used & character. Also, the @ character is used in place of =.

Perhaps the active URI above isn't a valid URI? I've read through the spec a little, but it's reasonably dense and difficult to keep all in mind and interpret. I'm trying to imlement the schemes from "resource oriented computing" and it struck me that although they're a bit odd and non-standard in terms of the web, they seem like they should be valid.

In this scheme, also, we're using :/ rather than :// to separate the scheme from the rest of the URI.

Is this crazy? I hope I haven't put too much of a spanner in the works. So far in my attempts at adjusting things, I've only done some aesthetic adjustments. I hope this is the right place for this comment.

JulianLeviston commented 6 years ago

Subsequently, I've realised it seems ? is the only query specifier character, so that portion of my comment can be disregarded.

ezrakilty commented 6 years ago

@JulianLeviston thanks for you comments.

You have several quite different points here so it might be best to split them into separate tickets. But let me see if I can answer them.

  1. I think everyone that comes to this library (including myself) is initially surprised that the separators are included in the parsed components. At this point, it's a quirk of history that needs to be maintained. The library takes a conservative stance toward its interface and it wouldn't be appropriate to break it at this point. The cost is that we must conditionally strip off a character when using some fields; it's an annoyance but not devastating, I think.

(One could fashion another module, say Network.URI2 or Network.URI.Stripped, which would offer the data in the delimiterless style that everyone seems to expect. So far, it hasn't seemed urgent enough to pay the cost of maintaining another durable interface.)

  1. The active: URI scheme was never fully standardized, IIRC, but in the terms of RFC3986, the library correctly parses it. On the other hand, it doesn't extract as much information as you want, because the active: URI scheme defines further structure that doesn't apply to all URIs. (This is often true for other schemes; for example, the familiar query string format name1=value1&name2=value2 is defined only by the HTML form-submissions protocol; it's not even defined by the http: URI scheme). So, in any event, parsing active: in a meaningful way will require more code.

  2. In keeping with the conservatism of the library, it won't likely shift to use a typeclass or other data-abstraction technique anytime soon, though there be merits to that approach. On the upside, URIs have a very stable, "transparent," structure over the past few decades, so I don't anticipate needing to modify the internal representation.

You mention extending the work… if you did want to build another interface on top of it, you could of course do that, creating your own Network.URI2, say, which could leverage the basic algorithms in this library but expose a different set of constructors and accessors. That sort of "repackaging" seems to me more useful than adding a layer of dispatch to obtain representation independence, in this case.

Hope that helps—if not, please do open new tickets so that we can get a fresh, and focused, discussion going. Cheers—Ezra

JulianLeviston commented 6 years ago

@ezrakilty all good comments. Thank you very much. I agree with everything you've said. I just now finished the first spike for the adjustments I wanted to make and it's working for me quite well so far. I'm using the scheme to determine how to parse the separators which is probably a bad idea. Ideally the parser/printer could be configured with some state to allow it to know how to parse and pretty print these things. I'll no doubt have more input as time passes. You're right, there are too many issues in this one thing. And, one of them was with Haskell itself (that backpack seems to address — the module system). I'll try to keep things more clearly separated in the future. Thanks again! :)