Add *consistent* wildcard support

HansBrende commented 2 years ago

I realize that there is already a PR to add wildcard support. However, that PR is going to add some pretty big inconsistencies between named and anonymous wildcard groups, in a very confusing way. See: https://github.com/WICG/urlpattern/issues/163

The response over there so far has been "too bad, so sad, who cares."

So I'm taking the initiative to create this PR to at least demonstrate that there is a way forward with wildcard support that actually makes sense.

Why do I care? Because once URLPattern comes out of experimental status, we will be stuck with the choices we make today, FOREVER. If the path-to-regexp library itself chooses the right way forward, then maybe URLPattern will follow suit, since it seems to care a lot about compatibility with this library.

How did I go about making the two behaviors compatible with each other? Simply by not disallowing empty path segments in named wildcard groups, and not requiring a trailing slash in anonymous wildcard groups.

Hopefully my efforts here at least give some people food for thought.

P.S. the pre-commit hook that added 600 lines to this PR is really annoying.

P.P.S. Fixes #214. Also fixes or re-fixes the previously closed issues #228, #212, #196, #194 (for the case where it makes sense), #103, #87, and #37.

Detailed breakdown:

Issue	Pattern	Expectation	This PR	WICG PR	v6.2	v2 - v5	v0.1.7
#214	`/files/:path\.:ext`	/files/my/photo.jpg/gif	✅	❌	❌	✅	💀
#228	`/:everything*`	/	✅	❌	❌	❌	❌
#228	`abc/:everything*`	abc	✅	✅	✅	✅	❌
#212	`#/*`	#/	✅	✅	💀	❌	✅
#196	`*`	/any	✅	✅	💀	❌	✅
#194	`/foo/:bar*`	/foo/test1//test2	✅	❌	❌	❌	✅
#103	`/entity/:id/*`	/entity/foo	✅	❌	💀	❌	❌
#103	`/entity/:id/*`	/entity/foo/	✅	✅	💀	❌	✅
#87	`/test/*`	/test	✅	❌	💀	❌	❌
#37	`/some-path/*`	/some-path/anything	✅	✅	💀	❌	✅

(✅ = match; ❌ = no match; 💀 = parse error; all tests run in strict mode)

blakeembrey commented 2 years ago

Let me fix the codebase, it's my bad for merging a dependabot PR (which I also had to disable due to sheer volume of noise) which bumped prettier without the concurrent re-formatting. I'll fix that now and you can rebase the changes.

Aside from that, I'm less sure about the actual behavior of /files/:path*\.:ext* and /:everything* passing. Especially the first will be a regression for some users who might be using this in a non-path positions. Same with the ones allowing //, the implication there is that we're allowing empty segments. Does that mean /:test should work with just /?

blakeembrey commented 2 years ago

The other thing to keep in mind with * support is backward compatibility with existing scripts. The behavior of /test/* is actually incompatible with old path-to-regexp and incompatible with people migrating from other related syntaxes, such as older specs in web that might just treat * as a simple wildcard. Making it behave as a segment makes it harder for customers to properly migrate. An example in hostname search would be http://*.example.com which clearly does subdomain matching today, but in a world where it acts like a segment the subdomain becomes optional (currently path-to-regexp is only prefix segments, but the example still works if you do foo.*.example.com instead).

Overall the most unfortunate decision in consistent behavior has been the inclusion of magic prefixes which makes all this behavior harder to follow. If it was explicit the confusion would not exist, e.g. foo{.*}.example.com would be an optional domain segment while foo.*.example.com is explicit. The same concept for paths too, e.g. foo/* vs foo{/*}.

blakeembrey commented 2 years ago

Also forgot to mention in my initial review but the summary table you created is really awesome regardless of the actual output of the PR 💯

HansBrende commented 2 years ago

@blakeembrey thanks for the responses and follow-up questions.

Does that mean /:test should work with just /?

No, I would say definitely not. In the absence of a wildcard character, a variable should always have a non-empty value, IMO. The reason it makes sense to me to allow empty path segments when a wildcard is specified is that in this case, the intention is clear: devs want to match for anything. So any valid path should match... regardless of whether or not it contains empty path segments. (This is particularly useful, fyi, in routing libraries. See, e.g., react-router or wouter for examples of how they are using the same syntax. It is quite unexpected behavior to specify a 404 for anything that doesn't match the previous routes, only to have a blank page instead of your 404 page show up because the actual url typed in by the user happened to contain an empty path segment.)

Especially the first [/files/:path.:ext] will be a regression for some users

But in this case my PR is consistent with v2 thru v5, right? So wouldn't that be fixing a regression?

...in a world where it acts like a segment the subdomain becomes optional (currently path-to-regexp is only prefix segments, but the example still works if you do foo.*.example.com instead)

True, however... the behavior of the WICG PR will do a similar thing by allowing foo..example.com to match, which isn't even a valid hostname. If users really want correct behavior in this case they should probably be doing foo.(.+).example.com, or just specify that the only prefix in the config should be "/" and not both "." and "/". Alternatively, I could also simply amend this PR to only do this for forward slashes and not dots, since that's really the use case anyone cares about, and paths & hostnames obviously have different semantics. I agree that abc.* makes slightly less sense to have an optional trailing dot than abc/* with the optional trailing forward slash.

EDIT: and honestly I'm not sure the subdomain issue is much of a problem in the first place, because even though foo.*.example.com "matches" foo.example.com with this PR, the regexp exec result[1] will be undefined rather than empty string. So if that's really important to the developer to have it not match a url missing the subdomain portion (which... I would argue... is probably less likely than the alternative use-case), they could instead check if the result of the match is not undefined; i.e., they could simply switch the logic from pattern.exec(path) != null to pattern.exec(path)?.[1] != null.

EDIT #2: OR simply do foo.{*}.example.com (which basically aligns with existing syntax/behavior anyways).

Thoughts?

blakeembrey commented 2 years ago

The reason it makes sense to me to allow empty path segments when a wildcard is specified is that in this case, the intention is clear: devs want to match for anything.

Isn’t that why you’d want the asterisk or (.*) group to make it clear? I don’t completely disagree, but it does make for some changes in other places. E.g. does path matching that makes arrays today leave in the empty holes or remove them? Does the optional delimiter also allow empty?

blakeembrey commented 2 years ago

Also something worth keeping in mind is whether we actually care about all these edge cases. In theory if you’re allowing repeated delimiters in one place shouldn’t we just allow it everywhere? But that makes for a more complex regex. And I’ve been of the opinion that the simple solution there is to fix it by normalizing paths at the router level before path matching. Arguably even the trailing slash should be at the router level to normalize it first.

HansBrende commented 2 years ago

@blakeembrey

Isn’t that why you’d want the asterisk or (.*) group to make it clear?

This could be done, however, there are a couple reasons why this would pose a major burden:

(1) Most front-end routing libraries (e.g. wouter, react-router) only expose a subset of the path-to-regexp syntax, and that subset does not include custom regexps at all! 99.999% of the routing architecture use-case can be accomplished without custom regexps, so it's a bit awkward to require one solely for the most common use-case in routing. (2) Subrouting/404-page handling is such a common use-case that I'd guess probably 50% of patterns being matched for in the routing use-case want the equivalent of a trailing wildcard. The cost of making one specify a regexp for this use-case every. single. time. is going to add up. (3) The wildcard syntax already exists... so devs WILL use it and prefer it over a manual regexp. And they will do so expecting it to mean one thing ("give me everything else!"). And in 99.999% of cases, it will mean that thing, so they won't change it. Except for the one rare edge case where the user mistypes an extra slash and the entire application breaks. In short: the wildcard syntax is SO handy that it would be a tremendous shame for it to fail on 1 odd-ball edge case in literally THE most common use-case for it.

does path matching that makes arrays today leave in the empty holes or remove them?

This is the part I don't really care about too much since I'm doing my own splitting logic when necessary... and only using the path-to-regexp part of the library (usually splitting is not necessary at all since the remaining path represented by the wildcard is either delegated to a subrouter, or sent to a 404 page). I do believe that the question of "how regexp match results are split" should be an entirely separate & independent question from the standardization of the actual generated regexp. That being said, it might make sense to leave in the holes here. Or take them out, either way. My gut tells me to leave them in, that way the original path can be reconstructed by joining the match with the "/" delimiter, rather than being lossy. Since the dev did not specify an overriding pattern, we're simply making the default pattern be (.*) in this case to unify behavior with the unnamed wildcard. Which allows empty string. So empty strings should be valid in the array. (I would think.)

Does the optional delimiter also allow empty?

No. The use-case for the optional quantifier is pretty much 100% across the board either an optional path segment or an optional affix of a single path segment. Therefore, the optional quantifier should not imply a different default pattern than that of a mandatory path segment variable. The use-cases where a dev would want to match a variable or empty string in the former case are basically non-existent, since nobody writes their routes that way (for a very good reason). And if they did, they could simply use 2 patterns: 1 for the non-empty case (e.g. foo/:var/bar), and one for the empty case (e.g. foo//bar). (Alternatively: foo/:var([^/]*)/bar if they didn't want 2 patterns). That way it's explicit that they are expecting a very weird use-case to be handled. With the wildcard, however, the main use-case is that you really don't care about the internal structure of the match at all, if you haven't specified a pattern. Which means that the default pattern for wildcards should be "anything".

And I’ve been of the opinion that the simple solution there is to fix it by normalizing paths at the router level before path matching.

That's not a robust solution though, because sometimes empty path segments actually mean something. (E.g. https://en.wikipedia.org/wiki///). If you normalize prior to putting the URL through the router, then you are arbitrarily deleting information from the URL that might indeed be used by subrouters. (If only to simply display the correct error message to the user, e.g. "404 path suffix foo////bar was not found in the directory!") This poses a particular problem for generic routing libraries, who cannot make the assumption that the user's routes can be normalized without changing the semantics.

HansBrende commented 2 years ago

@blakeembrey P.S. -- if you get a chance and haven't already, please do check out the original issue I filed in URLPattern: https://github.com/WICG/urlpattern/issues/163.

In that issue, I go a lot more in-depth into all the pros & cons that I see here, as well as listing out tables of possible different behaviors. (The behavior I've implemented in this PR is equivalent to "Proposed Behavior 1" in that issue.)

blakeembrey commented 5 months ago

I have added the tests from this PR description to the library in https://github.com/pillarjs/path-to-regexp/commit/cb27d379bc4d604cf9344ddec4fd5a0f18e4147d to be released in v7. Unfortunately I haven't supported * standalone behavior with an optional / prefix in the match due to difficulty maintaining compatibility with the match and compile parts of the library. I'm probably missing something obvious but I can't get the existing tests passing while also support standalone * and suspect I need to think more about it, it's definitely been giving me a headache.

The approach I took to get as close as possible was to allow a set of characters to repeated indefinitely. In this case just / by default. This maintains existing behavior and enables some failing cases above.

HansBrende commented 5 months ago

@blakeembrey Cool stuff! I made sure all the tests passed whenever I wrote this PR (if I recall correctly). Which test(s) was failing that you wanted to succeed with the standalone *?

blakeembrey commented 5 months ago

Which test(s) was failing that you wanted to succeed with the standalone *?

It's mostly the decoding of repeat parameters, I had to make it a regex match to get each sub-section due to the split now being regex-like, and .* results in empty parameter parts. It's not impossible to fix, but not ideal either.

I'm more in the camp of treating * as a separate behavior (such as .*) after playing around with it for the past few days. It's hard to make these two things reconcile and feels like a losing battle.

On the other hand, I'm trying to recall conversations with @wanderview on this and IIRC the * was for backward compatibility with existing patterns in the web platform. But I'm unsure if the existing platforms expect .* or something else, since I'm kind of guessing it was for patterns like *.example.com which matches glob behavior more than it does .* behavior.

Finally, on that note, I'm also wondering out loud whether trying to make these two things reconcile is a waste and instead I should just support the basics expected of glob matching such as *, **, etc. A lot to unpack here, I don't expect any answers, but certainly appreciate perspective 😄

HansBrende commented 5 months ago

@blakeembrey gotcha! Am I correct in stating then, that the current behavior you've implemented should nearly* match my "Proposed Behavior 2"?

If that's the case, obviously that works for me since I proposed it as the minimal necessary change to ensure basic consistency.

My perspective would be, it makes perfect sense to go with "Proposed Behavior 2" if you are planning on adding glob-matching-consistent support for *. Otherwise, "Proposed Behavior 1" (implemented in this PR) seems more ideal.

(In either direction you go, I believe .* is supported only inside parentheses, so there is no possible overlap in semantics with that particular syntax.)

* Proposed Behavior 2 with one tweak	/foo	/foo/	/foo/bar/	/foo//bar
`/foo/:rest*`	✅	✅	✅	✅
`/foo/*`	❌	✅	✅	✅

blakeembrey commented 5 months ago

that the current behavior you've implemented should nearly match my https://github.com/whatwg/urlpattern/issues/163?

So currently all I've done is add an option to make / become /+ which resolves most of the concerns around (.*) (since that's similar to [^/]+(/+[^/]+)*. The one gotcha is that standalone *.

So I believe it currently matches "Proposed Behavior 1" as a result of that.

I do want to make sure encoding/decoding works with whatever I do, so having * behave like [^/]*(/+[^/]*)* is pretty close to :path* behavior, but not the same. But I'd be in favor of making that change and just having * behave differently after this exercise.

But if they are going to behave differently, I am wondering if it should just match glob semantics that people already know, allowing for it to be used as *.example.com or **.example.com, and /*/file.png or /**/file.png.

HansBrende commented 5 months ago

@blakeembrey ahhh, after running through all my tests with the new behavior, I see what you are saying.

First of all, amazing work!!! Except for that one test case (* and "/any"), it seems that all the weirdness is fixed with the new behavior! sigh of relief

Second thing: it's worth noting that, for that one failing test case, the results in the current behavior are consistent between named and anonymous wildcard groups.

I.e.: * "/any": NO MATCH :rest* "/any": NO MATCH

However, I think it would be easy enough to fix this in both cases (should you want to) simply by prefixing that component of the regex by something like the following:

(?:^\/+)?

i.e., throwing away any preceding delimiters if they occur at the beginning of the string (in which case they are not handled by the previous path segment's regexp simply because there is no previous path segment regexp to strip them away). Thoughts?

EDIT: Or, on further inspection:

I took a look at the generated output for */*, which is as follows (after replacing (?:[^\/]+?)(?:\/+(?:[^\/]+?))* with "RELPATH_REGEX":

^(?:(RELPATH_REGEX))?(?:\/+(RELPATH_REGEX))?(?:\/+)?$

This is very similar to the regex for * alone:

^(?:(RELPATH_REGEX))?(?:\/+)?$

So, it seems it would suffice to simply change the first capturing group in the list to be just like the others, except with \/* instead of \/+, i.e:

^(?:\/*(RELPATH_REGEX))?(?:\/+)?$,

or, even just preceding the whole expression with an optional slash, similar to how the expression is suffixed with an optional slash, i.e:

^(?:\/+)?(?:(RELPATH_REGEX))?(?:\/+)?$

which can be simplified, by the way, to:

^(?:\/*)(?:(RELPATH_REGEX))?(?:\/*)$

blakeembrey commented 5 months ago

throwing away any preceding delimiters if they occur at the beginning of the string (in which case they are not handled by the previous path segment's regexp simply because there is no previous path segment regexp to strip them away).

I considered this but it's a little odd to me. No strong opinions, but it does tend to feel like (.*) is simpler and still split on the / anyway when decoding 🤷 Although it's consistent, I suspect people might be surprised when using * to find it's not just (.*). Which sucks to say, but I'm leaning toward URLPattern having made a reasonable decision on this one. Trying not to be biased by the fact they've already made the change, but it is tricky for existing user expectations. Especially because I'd be adding this back for Express.js v5 which has * mapping to (.*) already.

To you, is there any difference at the point of the above between .* and /+?[^/]+...? I'd split it similarly and the behavior should add up the same way.

HansBrende commented 5 months ago

@blakeembrey I edited my above comment to add a few more suggestions.

I nearly agree with you: * should match /any, on that we are agreed. And it doesn't matter to me whether it matches (.*) or /+?(RELPATH). The former I think is even preferable. The only question in my head is:

Shouldn't :rest* also match /any and give consistent results with *?

My strong opinion is that if * matches, then :rest* matches, and vice versa. Any reason for this not to be the case?

My weaker opinion is that they should both give consistent matching results. But if in this one case, :rest* strips off the leading slash and * does not, I probably wouldn't mind.

(Note: I think this choice would also apply to all paths of the form */rest/of/path and :rest*/rest/of/path).

blakeembrey commented 5 months ago

Any reason for this not to be the case?

No good reason. It just feels kind of janky for that to behave differently to /:path? and /:path+. Shouldn't they all allow that optional / at the beginning? So far no one has made a real issue about :path* not matching / at the beginning, so I believe it's been acceptable and understandable, while I do believe the * by itself will cause a bunch of gotchas.

blakeembrey commented 5 months ago

I think I'm seeing * in those cases more as a modifier to make repeatable patterns in matching, vs trying to make a named wildcard. I'm seeing * by itself as just a wildcard. And I believe the /+ fix makes each more reasonable now and fixes the caveats I've seen mentioned.

HansBrende commented 5 months ago

No good reason. It just feels kind of janky for that to behave differently to /:path? and /:path+. Shouldn't they all allow that optional / at the beginning?

@blakeembrey Those two, unlike :path*, are prefixed with /. To compare apples to apples, we'd need to consider :path? and :path+.

Should * match /any? Both agree yes ✅
Should :rest* match /any? (I would argue yes to bring consistency with * to 100%... just seems a shame to stop at 99.9% consistent!)
Should :rest+ match /any? (I would argue yes for consistency with :rest*)
Should :item and :item? match /any? (I would again say yes, given that HTTP URL paths are always guaranteed to begin with a forward slash, so the only possible semantics the user could have intended by putting :item/... at the beginning of the path is if it was meant to match the first path component, regardless if the root-level forward slash were still present or stripped off). This would be a convenient way to match the first path component, regardless if it were relative or absolute.

Having said all that, I do not have a strong opinion on this point after all, since I realized that this distinction will never come up for me anyways in real life... given that all URL paths always start with a forward slash (and therefore I can always pre-normalize path patterns to include a preceding forward-slash if they are missing one prior to generating the regexp).

Making the corresponding root-level forward slash optional in the path pattern would be convenient since I could then omit all the first forward slashes from patterns and they'd still match the right paths, and it would boost that consistency between * and :rest* to 100% which would be neat.

(Actually, probably the biggest tangible bonus I can think of to doing that would be to simplify the documentation: you could just say that * is exactly the same as :named_wild* without using the word "except" 😄 .)

But I think either way would work as both solve all of the actual, tangible use-cases here.

HansBrende commented 5 months ago

@blakeembrey to illustrate what I mean above:

const url = new URL('https://example.com');
url.pathname = 'any'; // No forward slash!
console.log(url.pathname);

prints:

'/any'

To summarize: since a pathname without a leading forward slash is illegal, one behavior that would make sense (and bring us to 100% consistency) would be that any path pattern without a leading forward slash would assume that it should be lenient towards the leading forward slash that will (necessarily) be present, since that's the only possible semantics it could have other than "match no path at all".

blakeembrey commented 5 months ago

since a pathname without a leading forward slash is illegal, one behavior that would make sense (and bring us to 100% consistency) would be that any path pattern without a leading forward slash would assume that it should be lenient towards the leading forward slash that will (necessarily) be present

Good call, I'll think about it today. There are use-cases outside of paths but it could be a config option.

On a final note, if we did have full consistency with that change, how do you feel about /+ and /?? I'm kind of ambivalent about them existing, which is another reason I haven't felt it was required too much to maintain consistency between :path* and *.

HansBrende commented 5 months ago

@blakeembrey I'm not sure what /+ and /? are... I have never used syntax like that or known of its existence.

Or are you talking about as they occur inside the generated regexp? What is the context in which those would be used?

blakeembrey commented 5 months ago

Oh, the simple change in my PR was that all modifiers "just worked" as is. So it's the same behavior as if it's :path+, :path?, :path*. It was actually less code to enable it that way when I was playing around with it 🤷

blakeembrey commented 5 months ago

PR example: https://github.com/pillarjs/path-to-regexp/commit/246ec9f6aa2cc2b90d249438ac053f57caecdf1b

I've since reverted trying to think over how * should behave, whether they should all exist, etc. As you can see I've been playing with a few versions of the behavior (the original having ?, +, * act like regex instead of parameters).

HansBrende commented 5 months ago

@blakeembrey I think I would need a specific example of a path with one of those syntaxes in it to say for sure, but if you're talking something like:

/my/+long/+path/+which/+can/+have/+multiple/+slashes, I think that would be an eyesore.

(Not to mention, then /* would be overloaded in a very bad way.)

blakeembrey commented 5 months ago

Maybe it's not clear, it's just the same as * is acting as :0*, + would act as :0+. So you'd use it when you want to match one or more parameters, like today, instead of * with zero or more. So /foo/+ would match /foo/bar, /foo/bar/baz, but not /foo. However, * (under the parameterized version, not the .* version) would match all of them.

HansBrende commented 5 months ago

Ahhhhh, I see what you mean. I thought you were going for /+ matching /, //, ///, etc. 😆

That makes perfect sense to me. For example, currently I would accomplish the same thing via:

/foo/:first/* (and then ignore "first").

There are probably not as many use-cases for these versions though... but if it's simpler to implement, that is certainly compelling... simplicity is often the best strategy.

One thing I would mention is that /? might be slightly confusing, since one could quite understandably mistake that as an optional trailing slash. (Whereas nobody would mistake /* as repeated trailing slashes due to the ubiquitousness of glob matching). I don't think that would be a blocker though, if you wanted to go that route.

HansBrende commented 5 months ago

On further thought, if you did go that route, I would advise only doing it for + and *, but not ?.

That is in part due to the potential confusion I mentioned above, but also in part due to a different inconsistency: we'd have unnamed wildcards, unnamed optional segments, but no unnamed required segments. So you'd almost have to add a standalone : for consistency with the standalone ?.

In general, I think the only one this is actually important for is * (by virtue of being the only one with overloaded semantics in the first place, and the one that covers 99.999% of use-cases), but + could make sense as well depending on how you slice the cake (if you did want to add ?, I would also add : though, otherwise there is a new consistency gap.)

blakeembrey commented 5 months ago

Thanks, I agree with your assessment, it’s part of why I’ve been thinking the asterisk should just follow the URLPattern route, although everything can pass your tests. It’s also why I was thinking about other existing matching schemes, like globs.

HansBrende commented 5 months ago

@blakeembrey on further further thought (sorry for my many iterations here), I would recommend not doing it for anything except *. Reasons being:

* is the only one currently with overloaded semantics, and the fewer operators with overloaded semantics, the better, in my opinion.
* is widely recognized as a globbing operator or wildcard, which isn't true for the others, so there's bound to be more confusion, since the others are only widely recognized as modifiers.
All of the others can be rewritten using *, leading to arguably more readable and explicit code. For example:
- /foo/+ can be rewritten as /foo/:_/*
- /foo/? should really be written as /foo/:_? for explicitness
- /foo/: can obviously be written as /foo/:_
Most importantly, * has all the use-cases behind it; the others do not.

That being said, still think it's important that * (only because it already exists as a standalone operator) at least have consistency with :rest*! That is already pretty much achieved in your current implementation, which I am very happy with 😃

And if you decide to make that one further tweak above concerning leniency to the root leading slash (bringing the other operators into alignment with * when not prefixed with a leading slash), that consistency will be 100% as if by magic.

blakeembrey commented 3 months ago

You probably saw https://github.com/pillarjs/path-to-regexp/pull/269#issuecomment-2227138973, but I went more extreme after much contemplation and decided the implicit prefix stuff isn't worth the hassle and repeating doesn't match much sense without implicit prefixes. So now ?, *, and + are only valid after {} params.

This should help me write better bug detection on user patterns. I'm waiting for feedback publicly if anyone hates having to use {} but I'm hoping it makes it easier for a new user to quickly see and understand what's going on. No need to wonder why /:bar* and #:bar* act entirely differently (i.e. latter is broken).

I also want to apologize in advance if it feels like I wasted any of your time too, as we didn't use this PR or solution, because I want to be super clear that I really appreciated this issue and your input directly.

pillarjs / path-to-regexp

Add consistent wildcard support #270

pillarjs / path-to-regexp

Add *consistent* wildcard support #270

Add consistent wildcard support #270