scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
50.99k stars 10.34k forks source link

Receiving 403 while using proxy server and a valid user agent #6313

Closed devfox-se closed 2 weeks ago

devfox-se commented 2 weeks ago

Hi I am facing this very strange problem.

I have setup a private squid proxy server that is accessible only from my IP and it works, I am able to browse the site that I try to scrape trough Firefox while having this proxy enabled.

via off
forwarded_for delete

Have only these anonymity settings enabled in my squid.conf file

But when I use the same server in scrapy trough request proxy meta key the site just returns 403 access denied For my very surprise the requests started to work only after I disabled the USER_AGENT parameter in my scrapy settings

This is the user agent I am using, its static and not intended to change/rotate

USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"

When I disable this parameter scrapy still uses the default user agent but for some reason I do not get 403 access denied error with it.

[b'Scrapy/2.11.1 (+https://scrapy.org)']

It is very confusing; can someone please help me to understand why does it fail with a valid user agent header?

Gallaecio commented 2 weeks ago

This is not a Scrapy issue, this is about a target website surprisingly accepting requests from Scrapy but not from a hard-coded web browser user agent string. You’ll have to either ask the website owners, or try to get help from the community, please do not open Scrapy issues to ask for help.