hagezi / dns-blocklists

DNS-Blocklists: For a better internet - keep the internet clean!
GNU General Public License v3.0
6.44k stars 216 forks source link

Fake Stores #926

Closed jarelllama closed 1 year ago

jarelllama commented 1 year ago

Which domain(s) should be blocked?

aldofashion.com amcclothes.com bagsdeuter.com bikehotsale.com binggrondahlshop.com bogsboot.com campingsurfshop.com cebesale.com clearance-bike.com clearanceusmen.com coffeeteaware.com cooeedesignshop.com cycle100percent.com
dbkdsale.com
dcsnowboard.com
discounthoneywell.com
discountskirts.com dolomiteoutlet.com dreamgreenshoe.com dtswissbike.com everythingkitchens.com fashionadid.com fashionshawaii.com femalecozy.com forksbike.com home-arabia.com homeclassiccollection.com kaemingkchristmas.com kaemingkdecor.com keltysale.com kleankanteenbottle.com kohlerofficial.com limitalfresco.com lovegoldsale.com mamapapaclothing.com modernusfemale.com nbdiscount.com newlawngarden.com newoutdoorsale.com ocycling.com officialskis.com officialyeti.com onlinekohler.com outdoorscarpa.com outdoorwintersports.com outlet-alfresco.com perlatoshoe.com pictureoutdoor.com popmenaccessories.com promoalfresco.com rossignol-ski.com salealfresco.com saleberghaus.com salejimshore.com salejomercer.com saleplaymobil.com saleprotest.com salesnowgum.com saleussports.com scarpaoutlet.com serengetidiscount.com shopjeanswest.com shopraidlight.com shopyourturn.com showapparels.com skileki.com smithskigear.com snowroxy.com soreldiscount.com sportsbrooks.com store-junior.com store-swatch.com storeadid.com storebyon.com storeskiwear.com telemarktalk.com thecycleshoes.com themountainus.com themountainwarehouse.com theoutdoorsgear.com thesignaturehardware.com thewesternshoes.com tnfbackpacks.com toolmartin.com turnnetwor.com ukoutdoordeal.com usakidsclothes.com usburtonsnowboard.com usglacierbay.com usnewoutdoor.com usnewoutdoors.com usplussports.com ussportabout.com ussportpioneer.com volcomofficial.com waresusmiss.com westernhats.net

Why should the domain(s) be blocked?

Fake stores I gathered using a ChatGPT script I've been working on. The script uses a search term inputted by the user and searches Google for sites with the exact search term.

Search term used: committed to the progress of sustainability within our business practices. With consistent evaluation and attention, we are proud to have brought on changes that have an emphasis on renewable energy, recycling, and beyond.

The script also removes dead domains and any sites with 'scam' in the name like scamwatcher.com or any google.com and reddit.com domains.

I also manually went through the Google Search page to remove any false positives (surprisingly there weren't any).

@durablenapkin

jarelllama commented 1 year ago

This is the current script if anyone wants to have a look:

#!/bin/bash

read -p "Enter a search query: " og_query

query="\"$og_query\""
query=$(echo "$query" | sed 's/ /+/g')
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
num_results=100

search_url="https://www.google.com/search?q=${query}&num=${num_results}&filter=0"

search_results=$(curl -s -A "$user_agent" "$search_url" | grep -o '<a href="[^"]*"' | sed 's/^<a href="//' | sed 's/"$//' | awk -F/ '{print $3}' | sort -u | sed 's/^www\.//' | grep -v -i 'scam' | grep -v -E '^google\.com$|^.*\.google\.com$' | grep -v -E '^reddit\.com$|^.*\.reddit\.com$')

for domain in $search_results; do
    if dig +short "$domain" | grep -q '^$'; then
        continue
    fi
    echo "$domain"
done

echo "Search term used: $og_query"
jarelllama commented 1 year ago

@hagezi if you have any feedback I'd love to know

hagezi commented 1 year ago

Thank you. I think that's a great idea! First of all, if I were you, I would create a repository and make the lists available there, so I could add them as a source and others can use them too.

When deadchecking via dig, I would query on NXDOMAIN and query against an unfiltered DNS:

if dig @1.1.1.1 "$domain" | grep -q 'NXDOMAIN'; then
    # DEAD
    ...
else
    # OK
    ...

Deciding whether it is a real fake shop or just a shop with bad reviews is difficult to automate. I don't know what ChatGPT considers a fake shop and what not.

You could check against the Umbrella toplist, if the shop is there the probability that it is a real fake shop is lower, but no guarantee. I update the Umbrella toplist daily, you can find it here: https://raw.githubusercontent.com/hagezi/dns-data-collection/main/top/toplist.txt

hagezi commented 1 year ago

Checked your list againts Umbrella Toplist:

grep -F -w -f /media/nas/git/dns-data-collection/top/toplist.txt /media/nas/tmp/fake.txt

Found this shops on toplist:

everythingkitchens.com
outlet-alfresco.com
store-swatch.com

No Fake: https://www.trustpilot.com/review/www.everythingkitchens.com

Not clear:

outlet-alfresco.com
store-swatch.com
jarelllama commented 1 year ago

Thanks for the insight. I might consider making a blocklist but I probably won't update it daily since I have to do the updates and false positive checking manually in my free time.

The problem with automatic scripts like this is that some sites like scam-detector and Reddit may contain the same search terms (from reported scams or user posts). So legitimate domains may get picked up by the script. Even after using the script I intend to manually search the search term on Google to compare Google's search page with the list generated by my script. Still takes quite a bit of effort on my part to scroll through 40+ search results to spot false positives.

@durablenapkin might have better luck implementing a better approach at searching for sites with typical scam templates. I'm a novice when it comes to Bash.

durablenapkin commented 1 year ago

Thanks, queued - I'll see if I can look into some sort of auto-discovery of scam sites when I have more time!

jarelllama commented 1 year ago

@durablenapkin I'll be updating my script in my repo https://github.com/jarelllama/Scam-Blocklist I'm still working on it so it's not quite production ready yet.

hagezi commented 1 year ago

Nice, let me know if I can add the source.

jarelllama commented 1 year ago

The list is up: ~https://raw.githubusercontent.com/jarelllama/Scam-Blocklist/main/domains~

EDIT: New domains list: https://raw.githubusercontent.com/jarelllama/Scam-Blocklist/main/domains.txt

ABP list: https://raw.githubusercontent.com/jarelllama/Scam-Blocklist/main/adblock.txt

See https://github.com/jarelllama/Scam-Blocklist/issues/147 for more info.

END OF EDIT

My current process is:

All code is on my repo: https://github.com/jarelllama/Scam-Blocklist

Thanks for the help @hagezi

hagezi commented 1 year ago

Perfect, I will add the source, it will be included in every list version. Thanks for your work.

hagezi commented 1 year ago

For other maintainers certainly also useful: @sjhgvr @notracking @badmojr @stevenblack @bongochong @alex-302 (AdGuard) @bigdargon

Github: https://github.com/jarelllama/Scam-Blocklist List: https://raw.githubusercontent.com/jarelllama/Scam-Blocklist/main/domains

@jarelllama Thank you for your tireless efforts here.

hagezi commented 1 year ago

Included in all lists.

bongochong commented 1 year ago

I'll definitely consider this. Thank you @hagezi.

jarelllama commented 1 year ago

@hagezi just wanted to say thanks for the inspiration on creating this project. currently there is 1000+ scam sites found (dead yet to be filter). amazing how many new scam sites come out each day using the same templates.

do you have any further recommendations besides comparing against the toplist?

hagezi commented 1 year ago

@jarelllama Very good work! I have no more recommendations at the moment.

bongochong commented 1 year ago

After taking some time to dig through this, the list looks to be very useful in its own right, and would make a fine additional source for any compiled / aggregate list maintainers who seek to mitigate scams too. Thank you @jarelllama for the great work. I hope you don't mind if I start to integrate this into my compiled lists in a future update (it will of course be credited in my ever expanding readme files). Thank you again @hagezi for notifying other list maintainers of this project as well.

jarelllama commented 1 year ago

Thanks for the kind words @bongochong ! I will continue to update the code for better extraction of domains and filtering. I'm making changes to the code daily at this point. Hopefully soon the project will be as perfect as I want it to be.

notracking commented 1 year ago

Thanks @jarelllama I like this concept a lot, notracking is subscribed!

Good luck maintaining it ;)