lightswitch05 / hosts

Hostfile blocklist for ads and tracking, updated regularly
https://www.github.developerdan.com/hosts/
Apache License 2.0
1.5k stars 75 forks source link

Remove "duplicates" by removing all subdomains when a domain.tld is already given #393

Closed thomasmerz closed 1 year ago

thomasmerz commented 1 year ago

My follow-up PR will remove all subdomains for any given domain.tld because any subdomain will already be blocked if domain.tld has been blocked and there's no need (except for "marketing") to blow up this adlist.
This will reduce the size for the blocklist and improves performance and makes this list better maintainable πŸ˜ƒ

"Explanation":

$ host pixel.ad
pixel.ad has address 0.0.0.0
pixel.ad has IPv6 address ::

host $(sha512sum README.md | awk '{print $1}'|cut -b12-40).pixel.ad
c88553a937ca37301f26773d67c42.pixel.ad has address 0.0.0.0
c88553a937ca37301f26773d67c42.pixel.ad has IPv6 address ::

I'm using this little script to check/remove subdomains for any given domain.tld:

#!/bin/bash
# ---
# remove subdomains when domain.tld is already given:
# ---
for domain in $(grep -P "[^.]+\.[a-zA-Z]{3}$|^.[^.]+\.[a-zA-Z]{2}\.[a-zA-Z]{2}$" \
  docs/lists/ads-and-tracking-extended.txt | rev | cut -d"." -f1-2 | rev | sort -u | grep -vE "^0 "); do
    # do not change if only one line is there:
    echo $domain
    [ "$(grep -c "$domain" docs/lists/ads-and-tracking-extended.txt)" = "1" ] && continue
    # change:
    sed -i "/\.$domain\$/d" docs/lists/ads-and-tracking-extended.txt
done

Currently running this very long running task and as an example this removes the following sudomains:

 0.0.0.0 pixel.ad
-0.0.0.0 centro.pixel.ad
-0.0.0.0 clickserv.pixel.ad
-0.0.0.0 preview.pixel.ad
-0.0.0.0 up.pixel.ad
-0.0.0.0 www.pixel.ad
 0.0.0.0 roq.ad
-0.0.0.0 partner.roq.ad
-0.0.0.0 test.roq.ad
-0.0.0.0 www.roq.ad
…
thomasmerz commented 1 year ago

@dnmTX why don't you like a small and performant list without loosing anything? Can you explain please?