Open wumpus opened 6 years ago
Hi @wumpus and thanks for the issue and excellent supporting data! Some of these look for sure like no-brainers to add to the core set of trackers to block, while others look a little more dangerous.
I'm in the midst of working on a system to allow users to add/remove their own trackers, in which case I'd be far more willing to put many of these into the defaults. If I get stalled out on that update, I'll probably just add them to a minor update when I get an hour or so to play with and test them.
If you don't see any motion on this in a week or so, please prod me. Thanks again!
Just noticed this one, a little googling says it's been around for a while, and that it's common enough that some reddit subs have banned using it:
It's not just a token to strip, though. Normally only Amazon designs urls this poorly!
Hi. I'm a search engine guy, and I'm very interested in a well-tested list of strippable CGI args to reduce the work my crawler has to do. I tried to algorithmicly build a list by taking the top 1000 websites from an old Alexa list, plus a few hosts I care about, and took a sample of their URLs crawled by CommonCrawl, and then counting which cgi args appeared in many of the hosts.
The biggest was &utm_source appearing on 474 of the 1,000 hosts. I dropped everything fewer than 5 hosts. So, in theory, this is somewhat of a representative sample of the most popular ones... although CommonCrawl isn't totally representative of the web, of course.
Here is a list with examples of the ones that aren't currently in your configuration: