Closed jarelllama closed 1 year ago
This is the current script if anyone wants to have a look:
#!/bin/bash
read -p "Enter a search query: " og_query
query="\"$og_query\""
query=$(echo "$query" | sed 's/ /+/g')
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
num_results=100
search_url="https://www.google.com/search?q=${query}&num=${num_results}&filter=0"
search_results=$(curl -s -A "$user_agent" "$search_url" | grep -o '<a href="[^"]*"' | sed 's/^<a href="//' | sed 's/"$//' | awk -F/ '{print $3}' | sort -u | sed 's/^www\.//' | grep -v -i 'scam' | grep -v -E '^google\.com$|^.*\.google\.com$' | grep -v -E '^reddit\.com$|^.*\.reddit\.com$')
for domain in $search_results; do
if dig +short "$domain" | grep -q '^$'; then
continue
fi
echo "$domain"
done
echo "Search term used: $og_query"
@hagezi if you have any feedback I'd love to know
Thank you. I think that's a great idea! First of all, if I were you, I would create a repository and make the lists available there, so I could add them as a source and others can use them too.
When deadchecking via dig, I would query on NXDOMAIN and query against an unfiltered DNS:
if dig @1.1.1.1 "$domain" | grep -q 'NXDOMAIN'; then
# DEAD
...
else
# OK
...
Deciding whether it is a real fake shop or just a shop with bad reviews is difficult to automate. I don't know what ChatGPT considers a fake shop and what not.
You could check against the Umbrella toplist, if the shop is there the probability that it is a real fake shop is lower, but no guarantee. I update the Umbrella toplist daily, you can find it here: https://raw.githubusercontent.com/hagezi/dns-data-collection/main/top/toplist.txt
Checked your list againts Umbrella Toplist:
grep -F -w -f /media/nas/git/dns-data-collection/top/toplist.txt /media/nas/tmp/fake.txt
Found this shops on toplist:
everythingkitchens.com
outlet-alfresco.com
store-swatch.com
No Fake: https://www.trustpilot.com/review/www.everythingkitchens.com
Not clear:
outlet-alfresco.com
store-swatch.com
Thanks for the insight. I might consider making a blocklist but I probably won't update it daily since I have to do the updates and false positive checking manually in my free time.
The problem with automatic scripts like this is that some sites like scam-detector and Reddit may contain the same search terms (from reported scams or user posts). So legitimate domains may get picked up by the script. Even after using the script I intend to manually search the search term on Google to compare Google's search page with the list generated by my script. Still takes quite a bit of effort on my part to scroll through 40+ search results to spot false positives.
@durablenapkin might have better luck implementing a better approach at searching for sites with typical scam templates. I'm a novice when it comes to Bash.
Thanks, queued - I'll see if I can look into some sort of auto-discovery of scam sites when I have more time!
@durablenapkin I'll be updating my script in my repo https://github.com/jarelllama/Scam-Blocklist I'm still working on it so it's not quite production ready yet.
Nice, let me know if I can add the source.
The list is up: ~https://raw.githubusercontent.com/jarelllama/Scam-Blocklist/main/domains~
EDIT: New domains list: https://raw.githubusercontent.com/jarelllama/Scam-Blocklist/main/domains.txt
ABP list: https://raw.githubusercontent.com/jarelllama/Scam-Blocklist/main/adblock.txt
See https://github.com/jarelllama/Scam-Blocklist/issues/147 for more info.
END OF EDIT
My current process is:
All code is on my repo: https://github.com/jarelllama/Scam-Blocklist
Thanks for the help @hagezi
Perfect, I will add the source, it will be included in every list version. Thanks for your work.
For other maintainers certainly also useful: @sjhgvr @notracking @badmojr @stevenblack @bongochong @alex-302 (AdGuard) @bigdargon
Github: https://github.com/jarelllama/Scam-Blocklist List: https://raw.githubusercontent.com/jarelllama/Scam-Blocklist/main/domains
@jarelllama Thank you for your tireless efforts here.
Included in all lists.
I'll definitely consider this. Thank you @hagezi.
@hagezi just wanted to say thanks for the inspiration on creating this project. currently there is 1000+ scam sites found (dead yet to be filter). amazing how many new scam sites come out each day using the same templates.
do you have any further recommendations besides comparing against the toplist?
@jarelllama Very good work! I have no more recommendations at the moment.
After taking some time to dig through this, the list looks to be very useful in its own right, and would make a fine additional source for any compiled / aggregate list maintainers who seek to mitigate scams too. Thank you @jarelllama for the great work. I hope you don't mind if I start to integrate this into my compiled lists in a future update (it will of course be credited in my ever expanding readme files). Thank you again @hagezi for notifying other list maintainers of this project as well.
Thanks for the kind words @bongochong ! I will continue to update the code for better extraction of domains and filtering. I'm making changes to the code daily at this point. Hopefully soon the project will be as perfect as I want it to be.
Thanks @jarelllama I like this concept a lot, notracking is subscribed!
Good luck maintaining it ;)
Which domain(s) should be blocked?
aldofashion.com amcclothes.com bagsdeuter.com bikehotsale.com binggrondahlshop.com bogsboot.com campingsurfshop.com cebesale.com clearance-bike.com clearanceusmen.com coffeeteaware.com cooeedesignshop.com cycle100percent.com
dbkdsale.com
dcsnowboard.com
discounthoneywell.com
discountskirts.com dolomiteoutlet.com dreamgreenshoe.com dtswissbike.com everythingkitchens.com fashionadid.com fashionshawaii.com femalecozy.com forksbike.com home-arabia.com homeclassiccollection.com kaemingkchristmas.com kaemingkdecor.com keltysale.com kleankanteenbottle.com kohlerofficial.com limitalfresco.com lovegoldsale.com mamapapaclothing.com modernusfemale.com nbdiscount.com newlawngarden.com newoutdoorsale.com ocycling.com officialskis.com officialyeti.com onlinekohler.com outdoorscarpa.com outdoorwintersports.com outlet-alfresco.com perlatoshoe.com pictureoutdoor.com popmenaccessories.com promoalfresco.com rossignol-ski.com salealfresco.com saleberghaus.com salejimshore.com salejomercer.com saleplaymobil.com saleprotest.com salesnowgum.com saleussports.com scarpaoutlet.com serengetidiscount.com shopjeanswest.com shopraidlight.com shopyourturn.com showapparels.com skileki.com smithskigear.com snowroxy.com soreldiscount.com sportsbrooks.com store-junior.com store-swatch.com storeadid.com storebyon.com storeskiwear.com telemarktalk.com thecycleshoes.com themountainus.com themountainwarehouse.com theoutdoorsgear.com thesignaturehardware.com thewesternshoes.com tnfbackpacks.com toolmartin.com turnnetwor.com ukoutdoordeal.com usakidsclothes.com usburtonsnowboard.com usglacierbay.com usnewoutdoor.com usnewoutdoors.com usplussports.com ussportabout.com ussportpioneer.com volcomofficial.com waresusmiss.com westernhats.net
Why should the domain(s) be blocked?
Fake stores I gathered using a ChatGPT script I've been working on. The script uses a search term inputted by the user and searches Google for sites with the exact search term.
The script also removes dead domains and any sites with 'scam' in the name like scamwatcher.com or any google.com and reddit.com domains.
I also manually went through the Google Search page to remove any false positives (surprisingly there weren't any).
@durablenapkin