hagezi / dns-blocklists

DNS-Blocklists: For a better internet - keep the internet clean!
GNU General Public License v3.0
6.9k stars 229 forks source link

Request to Carefully Look Through The Domains #362

Closed sr093906 closed 1 year ago

sr093906 commented 1 year ago

https://github.com/zakird/crux-top-lists

The dataset adheres as closely as possible to user-initiated pageloads (e.g., it excludes traffic from iframes).

So, please treat them as domains visited by real human. And based on such an assumption, not a few can/should be treated as FPs.

The list is generated by downloading the repo's latest csv file and stripping http:// and https://

After that, entries seen in Fake, Threat Intelligence Feeds, DoH/VPN/TOR/Proxy Bypass (complete edition), Safesearch not supported, Dynamic DNS, Badware Hoster and Personal are removed.

Finally, common entries between the processed file and the raw domain version of ultimate blacklist are listed.

There are some bet and porn sites there, of course. For others, some are clearly FPs like those staring with blog., login, and others.

5.txt

hagezi commented 1 year ago

Skimming the list, I found some trackers, crappy hosts, ad domains, popup porn ads, subdomains of main domains blocked for good reason and co. Analyzing this list would take days to weeks. At the moment I don't have the time ...

durablenapkin commented 1 year ago

Interesting dataset - might be handy for TLD discovery but other than that it's almost impossible to analyze/test

sr093906 commented 1 year ago

Given scant interest and negative feedback, I will look through the list myself to find out 'a few' FPs based on my mood. Stay Turned.

hagezi commented 1 year ago

Thanks for the help.

sr093906 commented 1 year ago

Please re-open it. I haven't finished it.

sr093906 commented 1 year ago

I even haven't finished those beginning with 'a'.

hagezi commented 1 year ago

Ok, sorry.

You can also post the domains in this issue, no need to open a new one for each. But, as you like ...

martijk commented 1 year ago

Some numbers for those interested:

| List | Domains in top 1M |
|---|---|
| oisd small  | 597 |
| oisd big  | 2481 |
| HaGeZi multi light  | 2200 |
| HaGeZi multi normal  | 3034 |
| HaGeZi multi pro | 3542 |
| HaGeZi multi pro++ | 4871 |
| HaGeZi multi ultimate | 6511 |

Of course being part of the top 1 million most visited websites doesn't mean that it's a legit domain, so be careful with jumping to conclusions.

I applaud your efforts, by the way. Maybe a small subset of this list can be used to check whether a list is fit for inclusion, same as what oisd does. For example, some legit looking webshops are loaded from the NoTracking list, which in turn got them from here, which with all due respect looks like a pretty obscure and not frequently updated list. That raises the question whether NoTracking has a strict enough inclusion policy (and in return HaGeZi as well).

hagezi commented 1 year ago

I applaud your efforts, by the way. Maybe a small subset of this list can be used to check whether a list is fit for inclusion, same as what oisd does. For example, some legit looking webshops are loaded from the NoTracking list, which in turn got them from here, which with all due respect looks like a pretty obscure and not frequently updated list. That raises the question whether NoTracking has a strict enough inclusion policy (and in return HaGeZi as well).

Thanks for the advice, I'll see how I can get a handle on this.

@Notracking: How do you see that? I would think about removing the source mentioned.

hagezi commented 1 year ago

@sr093906 First of all, thanks for all your effort. Before you start the next wave, please wait, I'm just straightening a few things. When the build of the new lists is online I'll let you know. Then please continue testing against the new Ultimate. It will be ready in a few hours. Thanks ...

sr093906 commented 1 year ago

I hope the adjustments will save me much time.

hagezi commented 1 year ago

@sr093906 "cleaned" Ultimate is online. Should find less now ...

sr093906 commented 1 year ago

@hagezi Thanks for notification. I will continue the check later.

https://github.com/MISP/misp-warninglists/

Whitelist resources. Maybe some lists will be helpful.

hagezi commented 1 year ago

@sr093906 I've done more cleanup, the build is running now and will be through in a few hours. I'll let you know ...

hagezi commented 1 year ago

@sr093906 Update is live, cleaned pro to ultimate.

FYI:

Toplists: https://github.com/hagezi/dns-data-collection/tree/main/top

toplist.txt - Umbrella
toplist.tranco.txt - Tranco
toplist.chrome.txt - Chrome
sr093906 commented 1 year ago

Thanks for letting me know. I will check.

hagezi commented 1 year ago

@sr093906 STOP posting potential phishing domains to whitelist, check the phishing sources and report them there. If they are removed from the phishing lists, they disappear from my lists too!

Thanks, Gerd

hagezi commented 1 year ago

@sr093906 Please spare me with these Chrome-Toplist Crap sites from the lower ranks, I use for my TIF the Umbrella/Tranco Toplist as Whitelist. So the hosts you reported are not on either toplist if they are blocked by my TIF. Report them upstream if you think they are false positives.

Thanks, Gerd

hagezi commented 1 year ago

I have now spent hours on these issues. I cleaned up the lists using the Chrome Toplist. Everything that was safe to remove was removed.

Done.

notracking commented 1 year ago

I applaud your efforts, by the way. Maybe a small subset of this list can be used to check whether a list is fit for inclusion, same as what oisd does. For example, some legit looking webshops are loaded from the NoTracking list, which in turn got them from here, which with all due respect looks like a pretty obscure and not frequently updated list. That raises the question whether NoTracking has a strict enough inclusion policy (and in return HaGeZi as well).

Thanks for the advice, I'll see how I can get a handle on this.

@notracking: How do you see that? I would think about removing the source mentioned.

Well, Stonecrushers list is basically a scraped version of: https://www.watchlist-internet.at/unserioese-webseiten/ https://www.watchlist-internet.at/about-us/ https://www.oiat.at/

Though I will remove/disable it because it should have (at least) excluded their "Problematische Online-Shops" list, which mostly has shops with bad service (based on user reports).

hagezi commented 1 year ago

Thanks!