Closed quenhus closed 2 years ago
Here come big tables!
(It seems I made a mistake with DDG and Startpage searches URL)
github_copycats.txt
npm_copycats.txt
domain | has_ip | Google site: | DDG site: | Startpage site: |
---|---|---|---|---|
://npmmirror.com/ | Search π | Search π | Search π | |
://cnpmjs.org/ | Search π | Search π | Search π | |
://npm.io/ | Search π | Search π | Search π |
stackoverflow_copycats.txt
wikipedia_copycats.txt
Good idea! I'm truly surprised so many of them are already down. A sign that this business is not viable and that these will taper down? :crossed_fingers: I first though about checking the DNS expiration date, but we can probably expect a lot of them to fall into the hands of domain parking scum after they expire.
Checking a few of them, they are still listed in google, so filtering them is even more useful: it's even lower value results now that the links are dead, but they might stay in the index for several months. Some articles say broken links can stay up to three months in the google search index, but other engines might have an even longer retention period.
Ideally, we would monitor the number of hits for these searches and remove the filter when a certain threshold (10 results, to account for parking?) is reached. The google search API has a 100 searches/day free tier, so we could get one point a week for now, less when the corpus grows.
I've seen some people using https://github.com/funilrys/PyFunceble to filter out NXDOMAIN entries from blacklists.
However it would be great to use a Google "site:DOMAIN linux" to determine whether each domain is still used as a mirror.
I've noticed that many of the clones or their mirrors have similar structure, at least visually. It's not improbable that they have some code in common, at least between clones from the same author. One idea would be to use the search engine to query for mirrors, extract the links and pass these to a program that fetches the page and tests it against some pattern (maybe with xpaths). The program could disguise itself as one of the common crawlers to prevent it from being detected and blocked early on.
Another idea could be to perform some mathemagics/statistics on the search results, in a similar way to email antispam filters. If at least two different hosts seem to "have very similar content"/correlate in some way, generate a report that can be reviewed later on by a maintainer. (fully automated filters are too drastic considered that false positives will exist)
To help me to remove down websites and to review big lists of domains, I made a Tampermonkey UserScript helper.
Here is the documentation: https://github.com/quenhus/uBlock-Origin-dev-filter/wiki/Helper-to-Review-a-List-of-Domains
Example of the UI:
I don't want to keep useless block rules in the filter. I created a tool to help detect domains that are down (
/src/clean_data/main.py
).If anyone want to help me with that :D
I think we can remove domains without A/AAAA DNS response. However it would be great to use a Google
"site:DOMAIN linux"
to determine whether each domain is still used as a mirror. Don't forget to disable uBlock-Origin-dev-filter while doing so, otherwise you will only get empty responses.