mozilla / chronicle

find everything you've ever found
http://mozillachronicle.tumblr.com/
Mozilla Public License 2.0
16 stars 6 forks source link

normalize / canonicalize URLs to detect duplicates #228

Open jaredhirsch opened 9 years ago

pdehaan commented 9 years ago

I've briefly played with normalize-url and it seemed cool (although the logic seems simple enough for us to do ourselves).

We'd still need to run it through some filter that removes all those utm_* parameters (and any other parameters we may not want). I haven't found a node module which does that yet -- but it'd be nice to build and maintain for other unwanted tracking codes. Not sure if @nchapman was mentioning some Ruby module which did that so we can see what they filter against.

nchapman commented 9 years ago

https://github.com/postrank-labs/postrank-uri

pdehaan commented 9 years ago

Wow, that seems way better than my garbage module: https://github.com/pdehaan/strip-utm Basically all strip-utm does is parse a URL string, remove any utm_ query params and then call normalize-url so sort the remaining query params, and return the URL as a string.