searchmysite / searchmysite.net

searchmysite.net is an open source search engine and search as a service
GNU Affero General Public License v3.0
75 stars 7 forks source link

Web: Almost all searches are automated SEO searches, impacting running costs, so trying to block these #55

Closed m-i-l closed 2 years ago

m-i-l commented 2 years ago

The analytics solution shows there are currently around 10 visitors a day performing around 4 searches a day.

However, the logs show the server is getting several search requests every second, totalling 100s of 1000s of searches every day. Almost all of these have no referrer set, so it isn't clear where they're coming from. They're generally from different IP addresses with different user agents, suggesting that they may be real users rather than bots.

My guess is that one or more other web sites are presenting a search box, getting the results from searchmysite.net, and displaying the results on their own site. This isn't necessarily a problem if the site(s) are legitimate and make appropriate credits and the users are actually making use of the results, but right now I don't know if that is the case, so I'm going to try to block this to try to find out more information.

Simple solution is to have the search page check for a referrer and display a message if not present. If that is circumvented, e.g. via a dummy referrer, I'll need to investigate some Cross Site Request Forgery style protection.

Suggested message is "Please visit https://searchmysite.net/ to search searchmysite.net".

m-i-l commented 2 years ago

This has now been implemented. I'm now seeing almost all traffic (99.999%?) to searchmysite.net respond with "Please visit https://searchmysite.net/ to search searchmysite.net". The way it is implemented the search isn't being performed when that response is returned so it is taking load off the server, but the extremely high level of queries will still be putting load on the server.

So now need to see if whatever is doing this either (i) stops, (ii) contacts me so it can be handled properly (assuming it is legitimate), or (iii) implements a workaround and continues (in which case I'll reopen and implement a CSRF solution).

m-i-l commented 2 years ago

More than 24 hours after this change and the requests are still coming in thick and fast.

Unfortunately after moving to the cheaper hosting provider I don't get the same sort of monitoring, and the analytics solution (understandably) only reports on real users, so it isn't easy to give nice illustrations of what is going on.

But as a rough idea, there were 5 real searches yesterday, and 77,167 of the problem searches, so over 99.99% of the search traffic is problematic.

Looking at the search terms, most/all look like random snippets of text scraped off web pages, largely adult and gambling sites, e.g. "Powered By Tube Ace Tube Script" and "powered by pMachine bk8 Notify me when someone replies to this post", rather than actual queries a real person would have entered into a search box, so I'm starting to wonder if this is some kind of DDoS attack.

Anyway, now blocking at the nginx reverse proxy level to take some more of the pressure off Flask and docker:

        location /search/random/ {
                proxy_pass http://127.0.0.1:8080;
        }
        location /search {
                valid_referers server_names *.searchmysite.net;
                if ($invalid_referer) {
                        return 403;
                }
                proxy_pass http://127.0.0.1:8080;
        }
alcinnz commented 2 years ago

To me those searches look like spammers, probably trying to influence the autocompletion I don't believe you have.

I occasionally get spammers on my own personal sites submitting links through the tastiest form they can find. Doesn't take much to scare mine away, but seems you're having a harder time at it.

m-i-l commented 2 years ago

Thanks for your info. Why would a spammer try to influence autocompletion?

Looking at the logs it seems to me to be searching for strings that have been scraped off other websites, e.g.: "Here are the most popular 100 articles on "bad toothache pain"" "It is NOT ok to contact this poster with commercial interests." "This website is proudly using the open source classifieds software OSClass" "This link directory uses sessions to store information "Add Article"" ""I certify that i am at least 18 years old!" "Enter the code from the above image!" ""Enter the code from the above image!" "I certify that i am at least 18 years old!"" ""Use the articles in our directory on your website to provide your visitors" vostro" ""About this Qhub" "Recent badges""

Some random IP lookups suggest very geographically diverse sources, e.g. Russia, India, US, UK, The Netherlands, and slightly diverse browsers (mostly Safari, Opera, QQBrowser and Vivaldi).

So I'm wondering if it is some kind of bot farm which is farming original content to place on SEO link farms, or something like that.

Anyway, the nginx config seems to be stopping them from getting anything useful for the time being, and keeping the traffic away from docker + Flask so the CPU utilisation is a little smoother now. But I am still very curious as to what is actually going on.

m-i-l commented 2 years ago

Okay, I think I have found out what is going on. I got the first clue after searching the internet for one of the search terms I'm seeing: "Designed by Mitre Design and SWOOP".

So, there are a number of paid-for black hat SEO tools like ScrapeBox, GSA SEO and SEnuke. SEO spammers enter "scraping footprints" on these, combined with their search terms, to search for URLs to target. A simple scraping footprint example is "Powered by Wordpress", to search for pages that are probably blog pages generated by Wordpress, and example search term would be "best turntables", so the tool fires off search requests like ""Powered by Wordpress" best turntables" to the search engines. The tool can then use the list of results to do all sorts of further activities, e.g. to target with automated backlink generators, copy content to link farms, scrape for email addresses to spam, etc.

Presumably searchmysite.net has been added as a search engine to one or more of these tools (although I don't know which at the moment), which is why I'm seeing vast numbers of these searches.

Now I'm not sure how many, or even if any, of the results on searchmysite.net will be vulnerable to things like automated backlink generators. But with spammers it is a numbers game - if they feed in millions of links, they only need 0.01% to work and they've still got themselves 100s of URLs that will serve their needs.

Given how lucrative SEO spam appears to be and how well funded these spam enabling operations are, I think it is only a matter of time before they break through the simple defences I've put up, so I'm probably not going to be able to win this one long term on my own in my spare time with no funding, hence my reopening this issue.

One possible solution is to use Cloudflare Scrape Shield to protect searchmysite.net, which is kind-of ironic since Cloudflare is currently blocking the searchmysite.net spider as per https://github.com/searchmysite/searchmysite.net/issues/46. Another solution is simply to switch off the public search and focus on the search as a service as per https://github.com/searchmysite/searchmysite.net/discussions/57.

m-i-l commented 2 years ago

I'm now on over 160,000 spam bot searches per day. The nginx config is still holding up reasonably well, but I don't know for how long.

I've asked for ideas on https://www.indiehackers.com/post/how-do-you-block-the-seo-spam-bots-208f6eb503 .

So far, if the nginx config fails, alternatives include:

alcinnz commented 2 years ago

An idea that just occurred to me for you: switch from free-form text entry to a tagcloud. That could reduce the usefulness of your service for SEO spammers whilst preserving most of it for (the few) actual users.

m-i-l commented 2 years ago

An idea that just occurred to me for you: switch from free-form text entry to a tagcloud. That could reduce the usefulness of your service for SEO spammers whilst preserving most of it for (the few) actual users.

@alcinnz Thanks for your suggestion, and I like the idea. The challenge is that it would be difficult to get a tag cloud sufficiently comprehensive to replace the free text box. You can see from the Browse page that not many sites have tags, and the Filter shows that the tags that there are aren't that useful (top tags are blog, developer, software, programming). I know some blog search engines have manually tagged blog home pages, but (i) that could be a lot of work, and (ii) many of the interesting blogs don't just cover one narrow set of topics but have posts about all sorts of things. I also know there are auto-tagging algorithms, but the results can be a bit hit and miss. It might be something worth exploring as a better interface to the Browse section though.

P.S. There's a bit of a discussion on Hacker News about potential solutions at https://news.ycombinator.com/item?id=31395231 . So far it seems that trying to block them as early as possible via Cloudflare is the best option.

m-i-l commented 2 years ago

I've set up on CloudFlare, and tried a few settings like a firewall rule to block known bots and enabling Bot Fight Mode, but none have had the desired affect. I've asked for suggestions on the specific config required (assuming there is some) on the CloudFlare community at https://community.cloudflare.com/t/how-do-i-block-automated-seo-searches-for-scraping-footprints/384414 .

See also Firefox search integration and direct link to search results issue that the reverse proxy config and code change have introduced: https://github.com/searchmysite/searchmysite.net/discussions/59 .

eagle-dogtooth commented 2 years ago

Well this explains why Search My Site searches made via my personal Searx instance started getting blocked earlier this month :)

I'm wondering if there's a solution that will block the automated SEO searches without blocking small-time users using e.g. Searx?

m-i-l commented 2 years ago

Here's the latest stats:

Day Real searches Total searches
Mon 16 May 2022 1,722 143,796
Tue 17 May 2022 405 166,742
Wed 18 May 2022 141 71,522

Its a bit more tricky to work out what are the automated SEO search requests now there are more real searches mixed in with them (a good problem to have:-), and the stats from 18 May are missing around 10 hours because the log files filled up all the disk space (fortunately the only service that was impacted was the analytics).

But looking at the logs, I reckon quite a few real users with real searches have been blocked by the "block requests with no referrer" rule put in to try to block the automated SEO search, which is really unfortunate. I also really don't like that direct links to search results don't work as a result of this rule, plus there is the Firefox search bar issue too, so I've had a look at alternative solutions.

The best I've come up with for now is to "block requests with no referrer where the query string is longer than X characters". The idea is that this can block the longer ""Powered by \<system>" \<search term>" type of requests while still allowing shorter direct links without referrers, e.g. /search/?q=domain:michael-lewis.com. The nginx config for this is (noting that if statements can't have an and so there's a very strange looking workaround):

        location ~ ^/search/(browse|new|random)/ {
                proxy_pass http://127.0.0.1:8080;
        }
        location /search {
                valid_referers server_names *.searchmysite.net;
                set $showerror 0;
                if ($invalid_referer) {
                        set $showerror 1;
                }
                if ($query_string ~* "^.{36,}$") {
                        set $showerror "${showerror}1";
                }
                if ($showerror = 11) {
                        return 403;
                }
                proxy_pass http://127.0.0.1:8080;
        }

The number 36 is a bit arbitrary. Unfortunately this is now letting through some shorter automated queries like 'Powered by Qhub.com ', but does look like it is letting some real searches through that would otherwise have been blocked, so with a bit of tweaking of the number it might be possible to get a reasonable balance. It is still a long way off what I'd call a good solution though, so I'd still think of it as temporary.

I've set up Cloudflare so the site is now going through that as well. I've spent a bit of time trying out various options, including 2 which I'd have thought would have worked, but didn't. I still want to try a few more things out with Cloudflare before coming to any conclusions.

m-i-l commented 2 years ago

Well this explains why Search My Site searches made via my personal Searx instance started getting blocked earlier this month :)

I'm wondering if there's a solution that will block the automated SEO searches without blocking small-time users using e.g. Searx?

@eagle-dogtooth - I think the searx searches come without a referrer, so yes they'd have been blocked from 9 May. The change I've made made today should have unblocked them, but I don't know if adding Cloudflare will have caused an issue, nor if any future changes will affect it. At some point I'd like to take a deeper look at searx though, because it would be nice to make sure it works.

eagle-dogtooth commented 2 years ago

This fixes both Firefox integration and searx, at least for search terms within the length limit. Thanks! I have more comments about searx and a couple of questions - should I start a fresh thread for searx?

m-i-l commented 2 years ago

This fixes both Firefox integration and searx, at least for search terms within the length limit. Thanks! I have more comments about searx and a couple of questions - should I start a fresh thread for searx?

@eagle-dogtooth Yes, best start a new Discussion.

m-i-l commented 2 years ago

Latest stats are:

Day Real searches Total searches
Thu 19 May 2022 94 39,255
Fri 20 May 2022 53 21,535
Sat 21 May 2022 58 14,130
Sun 22 May 2022 38 15,931
Mon 23 May 2022 27 16,303
Tue 24 May 2022 19 11,115
Wed 25 May 2022 9 6,541
Thu 26 May 2022 10 5,858

The configuration on Cloudflare which has been running (unchanged) since Wed 18 May is:

FWIW, previous configurations I tried were to:

As an aside, there have been a couple of issues with Cloudflare:

So I think I'll leave Cloudflare on for now, but leave this issue open for the time being.

m-i-l commented 2 years ago

Latest stats:

Day Real searches Total searches
Fri 27 May 2022 12 2,282
Sat 28 May 2022 7 3,281
Sun 29 May 2022 9 4,609
Mon 30 May 2022 8 2,566
Tue 31 May 2022 12 3,145
Wed 01 Jun 2022 6 3,887
Thu 02 Jun 2022 9 7,170
Fri 03 Jun 2022 5 9,119
Sat 04 Jun 2022 3 4,259
Sun 05 Jun 2022 4 3,283
Mon 06 Jun 2022 3 3,959
Tue 07 Jun 2022 3 3,074
Wed 08 Jun 2022 1 3,219

So the number of automated SEO searches has been a manageable number for the past couple of weeks.

This doesn't appear to have happened as a result of any specific actions I've taken, unless it was the reverse proxy configuration which has returned a 403 for most of these for the past month.

For the record, current changes are:

m-i-l commented 2 years ago

Another thought - although I am blocking most of these "automated SEO searches" at the reverse proxy, even if they were to get through, I don't know how useful the results would be to SEO practitioners given the highly specific terms they are trying to "search engine optimise", e.g. to take 3 random ones I saw in the logs "liquid silicone molding", "White Cherry Gelato" and "pest control sachse" (apparently sachse is an area in Dallas) they don't return especially useful results from searchmysite.net (especially without double-quotes to make phrase searches).

Anyway, I've published a blog post at https://blog.searchmysite.net/posts/an-update-on-the-automated-seo-searches-issue/ with a full update on this issue, and am going to close it for now.