hanover-computing / canonicize-url

Get a stable, canonical version of any URL, with DNS and HTTPS checks, redirects, tracker stripping, and canonical link extraction!
GNU Lesser General Public License v3.0
12 stars 0 forks source link

Wait for clearURLs rewrite/how to address clearURLs' shortcoming? #1

Open JaneJeon opened 3 years ago

JaneJeon commented 3 years ago

Some of the rule semantics might change... or we might have to abandon tracker stripping entirely

JaneJeon commented 3 years ago

https://github.com/ClearURLs/Addon/issues/144

JaneJeon commented 3 years ago

An alternative method I'm thinking of is only using clearURLs for websites that won't automatically redirect you to the destination (e.g. youtube links, skimlinks, etc) - i.e. rawRules.

For "stripping" everything else, I think relying on existing UBO rulesets like https://github.com/uBlockOrigin/uAssets and https://github.com/AdguardTeam/FiltersRegistry/blob/master/filters/filter_17_TrackParam/filter.txt might be better?

Obviously UBO blocklists are 1. fucking huge and 2. do a LOT more than just "strip trackers from URLs", but preprocessing these lists by stripping off DOM-manipulating filters (or basically anything that doesn't touch the URL) and focusing on ones that are usually named "privacy" something should help.

In that case, I would need:

  1. Auto-updating list of "privacy" filters to strip trackers off of the URLs - AdGuard privacy list, UBO default & easyprivacy lists
  2. Implement anti-breakage shit (UBO list)
  3. Script to deduplicate filters
  4. Implement UBO parser only for URL transformations
  5. Run UBO blocklist-based tracker stripping alongside clearURLs
  6. Remove clearURLs implementation except for rawRules matching (which is what allows us to skip the intermediary "redirect" pages)

The motivation for skipping clearURLs for the actual tracker stripping is the existing clearURLs-based approach falls flat on obscure/chinese sites that don't provide any sort of "hints":

JaneJeon commented 3 years ago

Resources for directly integrating UBO to strip bullshit from URLs:

Also I have been made aware that even the annoyances rulesets contain tracker stripping magic? https://github.com/uBlockOrigin/uAssets/blob/02d16a221c276fe58bdd72cc947b26eaf9d1318e/filters/annoyances.txt#L4560-L4561