Open plambert opened 12 years ago
What do you think about using something like the canonicalisation format in https://github.com/mnot/squid-director ?
That'd work, but would need a way to specify regexp substitutions with backreferences for query parameter values (and keys), whatever ;.* parameters are called, the URL path, et al.
Also, supporting lowercasing of backreferences would be important. It should be easy to rewrite http://FoO.bAr:80/BaZ/qUx.HTML to http://foo.bar/baz/qux.html for the purpose of aggregation.
And that'd probably also be useful in squid_director.
Hmm.
What about a separate URL canonicalization library/tool that both can share?
Well, there's many levels to URL canonicalisation. E.g.,
The first two are pretty easy to do; the last really needs to be hinted by the site, like in the map that director uses.
IIRC director already does many generic and scheme-specific canonicalisations; think we could do that here too easily.
ideally, a config file with regular expression substitution-based rewrites would be awesome.
For example, a regex could replace UUIDs in URL paths with a placeholder like "[[UUID]]" so REST requests are aggregated together.
Another often-convenient one is simply s/\d+/[[DIGIT]]/g so that any numbers in the URLs are collapsed to a single token.
In addition, double-bracketed slugs could be converted to spans in the HTML output with CSS for a pretty text "slug." So a UUID replaced with [[UUID]] would stand out in the HTML output as not being part of the original URL.