newhouse / url-tracking-stripper

An open-source Chrome Extension that will remove the tracking parameters from URLs to keep them short and cleaner for sharing, bookmarking, etc. It will also skip any known redirects and take you straight to the target URL instead of passing you through an intermediate URL.
MIT License
192 stars 26 forks source link

More tokens to strip #48

Open wumpus opened 6 years ago

wumpus commented 6 years ago

Hi. I'm a search engine guy, and I'm very interested in a well-tested list of strippable CGI args to reduce the work my crawler has to do. I tried to algorithmicly build a list by taking the top 1000 websites from an old Alexa list, plus a few hosts I care about, and took a sample of their URLs crawled by CommonCrawl, and then counting which cgi args appeared in many of the hosts.

The biggest was &utm_source appearing on 474 of the 1,000 hosts. I dropped everything fewer than 5 hosts. So, in theory, this is somewhat of a representative sample of the most popular ones... although CommonCrawl isn't totally representative of the web, of course.

Here is a list with examples of the ones that aren't currently in your configuration:

# more utm_ -- I think people use utm_ as a prefix for their own purposes and/or Google doesn't document all of them

# https://www.mozilla.org/en-US/firefox/new/?f=30&ref=producthunt&utm_expid=71153379-28.SNKFJ4VqRziIW1TLqjhpAw.1&utm_referrer=https%3A%2F%2Fwww.google.com%2F

utm_expid (15 hosts)
utm_referrer (12 hosts)

# https://www.etsy.com/?utm_source=google&utm_medium=cpc&utm_term=etsy&utm_campaign=search_fr_fr-fr-src-pure-brand-exact-st_exact_etsy&gclid=EAIaIQobChMIk6Duvp6\
n1QIVjantCh1f-whGEAAYASAAEgLsx_D_BwE&gclsrc=aw.ds

gclsrc 22 hosts

# https://www.google.fr/chrome/browser/features.html?brand=CHBD&gclid=CN6B2tjusdECFVAQ0wodfmcISw&dclid=CM6vjtnusdECFcSjUQodyg4B2Q

dclid 21 hosts {similar to gclid?}

normally cookies

# Adobe ColdFusion
# https://techcrunch.com/?CFID=8494701&CFTOKEN=56974155

&CFID= 25 hosts, 70 total instances
&CFTOKEN= 25 hosts, 70 total instances

# PHP
# http://instagram.com/p/BUPpEcIDFjT/?PHPSESSID=dbj4v5fl2c6sd8f8986aprqpf3

&PHPSESSID= 5 hosts, 89 total instances

and here are the popular ones that you don't have at all:

# Web Trends

# http://www.nature.com/collections/dtfkmdgglg?WT.mc_id=SFB_NA_1017_FattyLiverGraphic
# https://www.microsoft.com/en-us/store/b/accessories?tid=vpOCJmmq&cid=5250&pcrid=3050714533&pkw=makerbot%20replicator%202%20desktop%203d%20printer&pmt=e&WT.srch=1&WT.mc_id=pointitsem_Microsoft+US_bing_5+-+Accessories&WT.term=make
# https://www.chase.com/ccp/index.jsp?pg_name=ccpmapp/shared/assets/page/repayment_examples&WT.ac=st_ctr_student&jp_aid=st_ctr_student&WT.mc_id=st_ctr_student_repayment&jp_mep=st_ctr_student_repayment&WT.pn_sku=repayment_plans&memberid=studentcenter
# https://www.intuit.com/company/press-room/press-releases/2013/QuickenPullsBacktheCoversonLoveandMoney/?WT.qs_osrc=TST-164886110

&WT.mc_id= 24 hosts, 2530 total instances
&WT.srch= 14 hosts, 422 total instances
&WT.ac= 8 hosts, 4094 total instances
&WT.qs_osrc= 5 hosts, 20 total instances
&WT.pn_sku

# Oracle Eloqua

# http://www.cray.com/company/policies-and-practices/privacy-policy?elqTrackId=2e97d2d4f56e41eb9498379bab9753db&elqaid=584&elqat=2
# http://www.blackboard.com/Platforms/Collaborate/Resources/Webinars-and-Demos.aspx?elq=a318adfc3e7e40de83e0883a1d6760ba&elqCampaignId=329

&elqTrackId= 12 hosts, 191 total instances
&elqaid= 12 hosts, 189 total instances
&elqat= 12 hosts, 189 total instances
&elqCampaignId= 7 hosts, 138 total instances
&elq= 7 hosts, 111 total instances

# comScore Digital Analytix:

# http://www.dailymail.co.uk/sport/rugbyunion/article-5082539/France-23-28-New-Zealand-Blacks-French.html?ITO=1490&ns_mchannel=rss&ns_campaign=1490
# http://www.hotstar.com/tv/cineplay/13080?ns_mchannel=Article&ns_source=Scroll&ns_campaign=Cineplay&ns_linkname=CineplayShowPage&ns_fee=0

&ns_campaign= 6 hosts, 97 total instances
&ns_mchannel= 5 hosts, 92 total instances
&ns_source=
&ns_linkname=
&ns_fee=

# suspicious but probably too generic

# https://www.cray.com/?leadsource=website&srcdes=seagate&campaign=7010b0000018kLW
&campaign= 15 hosts, 9072 total instances

# https://wordpress.com/create/?utm_source=bing&utm_campaign=WordPress-Generic-Exact-US-GP&utm_medium=cpc&keyword=wordpress&creative=9925335912&campaignid=12806\
5278&adgroupid=3099786316&matchtype=e&device=c&network=o
&campaignid= 6 hosts, 74 total instances
newhouse commented 6 years ago

Hi @wumpus and thanks for the issue and excellent supporting data! Some of these look for sure like no-brainers to add to the core set of trackers to block, while others look a little more dangerous.

I'm in the midst of working on a system to allow users to add/remove their own trackers, in which case I'd be far more willing to put many of these into the defaults. If I get stalled out on that update, I'll probably just add them to a minor update when I get an hour or so to play with and test them.

If you don't see any motion on this in a week or so, please prod me. Thanks again!

wumpus commented 6 years ago

Just noticed this one, a little googling says it's been around for a while, and that it's common enough that some reddit subs have banned using it:

https://www.youtube.com/attribution_link?a=dRBqlLWtf5U&u=%2Fwatch%3Fv%3Dpogq2tZFKKo%26feature%3Dshare

It's not just a token to strip, though. Normally only Amazon designs urls this poorly!