owasp-modsecurity / ModSecurity

ModSecurity is an open source, cross platform web application firewall (WAF) engine for Apache, IIS and Nginx. It has a robust event-based programming language which provides protection from a range of attacks against web applications and allows for HTTP traffic monitoring, logging and real-time analysis.
https://www.modsecurity.org
Apache License 2.0
8.08k stars 1.58k forks source link

Website crawl issues with status code 406 #2475

Closed reverieinc123 closed 3 years ago

reverieinc123 commented 3 years ago

Describe the bug We can't able to audit website using SEMRush tool in order to do that we need to whitelist below mentioned IPs.

Domain: www.reverieinc.com

Logs and dumps

User-agent: Mozilla/5.0 (compatible; SemrushBot-SA/0.97; +http://www.semrush.com/bot.html)

IP addresses: 46.229.173.66, 46.229.173.67, 46.229.173.68

zimmerle commented 3 years ago

Hi @reverieinc123,

The ModSecurity is just the engine. The inspections are held by the rules that you have loaded. What you have loaded is a custom rule or some public rule set?

reverieinc123 commented 3 years ago

Actually we have used SEMRush tool to audit our website www.reverieinc.com but its shows an error called blocking SEMRush bot crawler. We have contacted the SEMRush team as they mentioned The software can block our bots from crawling a website & thus prevent users from fully benefitting from our platform's functionality.

You may contact Mod Security in order to whitelist our bot.

I would like to understand why its happening

zimmerle commented 3 years ago

@reverieinc123,

ModSecurity process a ruleset against a request in order to asses if a request looks like something malicious or not. The one who classify a request as malicious is the ruleset. Somewhere in your ModSecurity configuration file you have a rule that is blocking the requests from this crowler.

Usually, when such thing happens it is the case to put the bot in some sort of allow list. Impossible to tell what to do without have the details on your ruleset. Can you provide the error message? The information on the rule set?

reverieinc123 commented 3 years ago

Dear Team,

FYI, attached log report.

Regards, Sudhakar

On Fri, Dec 11, 2020 at 10:42 PM Felipe Zimmerle notifications@github.com wrote:

@reverieinc123 https://github.com/reverieinc123,

ModSecurity process a ruleset against a request in order to asses if a request looks like something malicious or not. The one who classify a request as malicious is the ruleset. Somewhere in your ModSecurity configuration file you have a rule that is blocking the requests from this crowler.

Usually, when such thing happens it is the case to put the bot in some sort of allow list. Impossible to tell what to do without have the details on your ruleset. Can you provide the error message? The information on the rule set?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SpiderLabs/ModSecurity/issues/2475#issuecomment-743316991, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASCJ2THP4JGTZAIHW5GBWTTSUJHILANCNFSM4UVHOFFA .

-- The information contained in this e-mail message and/or attachments are confidential or privileged information of Reverie Language Technologies Pvt. Ltd. Unauthorized dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments.

IP addresses of Semrush Bot ( Site Audit ): 46.229.173.66, 46.229.173.67, 46.229.173.68

REQUEST: curl -i -sS -L --proto-redir -all,http,https --max-time 5 -A 'Mozilla/5.0 (compatible; SemrushBot-SA/0.97; +http://www.semrush.com/bot.html)' http://reverieinc.com/

RESPONSE: HTTP/1.1 406 Not Acceptable Date: Fri, 11 Dec 2020 05:57:58 GMT Server: Apache Content-Length: 226 Content-Type: text/html; charset=iso-8859-1

Not Acceptable!

Not Acceptable!

An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.

zimmerle commented 3 years ago

Hi @reverieinc123

Can you double check the logging link or upload? it is not on part of this issue.

zimmerle commented 3 years ago

@reverieinc123 ping.

reverieinc123 commented 3 years ago

Hi Team,

Could you please help me with this issue?

When we run Site Audit on SEMRush tools this is what I'm facing.

For the same thing, i have downloaded the log and attached the same please take a look.

[image: image.png]

Regards, Sudhakar

On Wed, Dec 16, 2020 at 6:52 PM Felipe Zimmerle notifications@github.com wrote:

@reverieinc123 https://github.com/reverieinc123 ping.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SpiderLabs/ModSecurity/issues/2475#issuecomment-746278486, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASCJ2THOVTWGAADUVRPTLDTSVCYBRANCNFSM4UVHOFFA .

-- The information contained in this e-mail message and/or attachments are confidential or privileged information of Reverie Language Technologies Pvt. Ltd. Unauthorized dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments.

IP addresses of Semrush Bot ( Site Audit ): 46.229.173.66, 46.229.173.67, 46.229.173.68

REQUEST: curl -i -sS -L --proto-redir -all,http,https --max-time 5 -A 'Mozilla/5.0 (compatible; SemrushBot-SA/0.97; +http://www.semrush.com/bot.html)' http://reverieinc.com/

RESPONSE: HTTP/1.1 406 Not Acceptable Date: Wed, 16 Dec 2020 13:31:34 GMT Server: Apache Content-Length: 226 Content-Type: text/html; charset=iso-8859-1

Not Acceptable!

Not Acceptable!

An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.

zimmerle commented 3 years ago

@reverieinc123 I am afraid you have to go over the GitHub website and paste your log

reverieinc123 commented 3 years ago

IP addresses of Semrush Bot ( Site Audit ): 46.229.173.66, 46.229.173.67, 46.229.173.68

REQUEST: curl -i -sS -L --proto-redir -all,http,https --max-time 5 -A 'Mozilla/5.0 (compatible; SemrushBot-SA/0.97; +http://www.semrush.com/bot.html)' http://reverieinc.com/

RESPONSE: HTTP/1.1 406 Not Acceptable Date: Wed, 16 Dec 2020 13:31:34 GMT Server: Apache Content-Length: 226 Content-Type: text/html; charset=iso-8859-1

Not Acceptable!

Not Acceptable!

An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.

reverieinc123 commented 3 years ago

Posted in the website please take a look.

Regards, Sudhakar

On Wed, Dec 16, 2020 at 7:17 PM Felipe Zimmerle notifications@github.com wrote:

@reverieinc123 https://github.com/reverieinc123 I am afraid you have to go over the GitHub website and paste your log

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SpiderLabs/ModSecurity/issues/2475#issuecomment-746309665, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASCJ2TGAEBO3YK3ND6KHJK3SVC3AVANCNFSM4UVHOFFA .

-- The information contained in this e-mail message and/or attachments are confidential or privileged information of Reverie Language Technologies Pvt. Ltd. Unauthorized dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments.

zimmerle commented 3 years ago

@reverieinc123 The error log is something that you can grab from your webserver log. In the logs we will have information on why such request got blocked. For instance: the rule ID and rule message.

You may want to look at your webserver errorlog. There, you will find the logs and the rule that actually caused this request to be blocked.

zimmerle commented 3 years ago

I am assuming that the user have sorted out the issue.

pirajki commented 2 years ago

I am getting this same error on https://www.yogiraj.co.in

All pages are showing 406 error.

robots.txt is ok

user-agent: * disallow: /feed disallow: /wp-admin/ sitemap: https://www.yogiraj.co.in/sitemap.xml