scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

Identical requests sent by Scrapy vs Requests module returning different status codes #4951

Open pmdbt opened 3 years ago

pmdbt commented 3 years ago

Description

Recently a spider I made to crawl cragislist for rental listings broke. When I checked the logs, it turns out that all of my requests were hitting http 403 error codes. Of course, I assumed the issue would be from not setting proper headers and not using proxies etc, so I went ahead and added auto user-agent and header rotation as well as adding in proxy servers. None of this helped. In a last-ditch effort, I wrote a simple GET request using the requests module. Well, somehow this default request ended up working on the same URLs with 200 status codes even though it's the same IP address without any proxy servers or user-agents configured.

I don't understand exactly how the request is sent out by Scrapy vs requests module, but even when I configured both to share the exact same request headers, one returns 403 error while the other returns 200. It seems I'm also not the only one to experience this weird result based on this StackOverflow post.

Steps to Reproduce

  1. Set up a default Scrapy spider with only default settings active.

  2. Install the latest version of requests and make a default GET request to any site using request.get("any site"). Get the headers used by this default request. For me it was:

    GET / HTTP/1.1
    Host: localhost:8080
    User-Agent: python-requests/2.25.1
    Accept-Encoding: gzip, deflate
    Accept: */*
    Connection: keep-alive
  3. Configure the headers of the Scrapy spider request call to have the exact same headers from step 2.

                scrapy.Request(
                    url="any website",
                    callback=self.parse,
                    headers={
                        "User-Agent": "python-requests/2.25.1",
                        "Accept-Encoding": "gzip, deflate",
                        "Accept": "*/*",
                        "Connection": "keep-alive"
                        })
  4. Start a Netcat server locally to make sure Scrapy and requests will send the same request object. I started mine on port 8080 with the command nc -l 8080. Now change the request URLs for both Scrapy and requests to "http://localhost:8080". Run both and examine the results.

For me, I see the following from Netcat for the request sent with requests module:

GET / HTTP/1.1
Host: localhost:8080
User-Agent: python-requests/2.25.1
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive

And I see the following from the Scrapy Spider's request:

GET / HTTP/1.1
User-Agent: python-requests/2.25.1
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
Accept-Language: en
Host: localhost:8080

So they should be sending the same request object with the same info.

  1. Change the url from "http://localhost:8080" to "https://lasvegas.craigslist.org/d/apts-housing-for-rent/search/apa". You should now see that for some reason the requests module request returns status code 200, while the spider's request returns 403 forbidden error. If you check the response body for the 403 code, you should see something along the lines of:
    
    # these are custom formatted log outputs from me

2021-01-08 20:26:02 [root] INFO: Http error code 403 with response:

        response headers: {b'Set-Cookie': [b'cl_b=4|5643cfbca785a2e77246555fdf34d45a3a666145|1610166362kVl2U;path=/;domain=.craigslist.org;expires=Fri, 01-Jan-2038 00:00:00 GMT'], b'Strict-Transport-Security': [b'max-age=63072000']}
        ----------------------------------
        original request headers: {b'User-Agent': [b'python-requests/2.25.1'], b'Accept-Encoding': [b'gzip, deflate'], b'Accept': [b'*/*'], b'Connection': [b'keep-alive'], b'Accept-Language': [b'en']}
        ----------------------------------
        body of response: This IP has been automatically blocked.

If you have questions, please email: blocks-b1607628794570390@craigslist.org

        ----------------------------------

**Expected behavior:** When sending seemingly identical requests to the same URL from the same IP address between a Scrapy request vs request module request, I expected both to return the same result with the same HTTP status code.

**Actual behavior:** The Scrapy request returns 403 forbidden while the requests module returns 200 OK.

**Reproduces how often:** 100% for me and another colleague in a different city and state.

### Versions

Scrapy : 2.1.0 lxml : 4.6.1.0 libxml2 : 2.9.10 cssselect : 1.1.0 parsel : 1.6.0 w3lib : 1.22.0 Twisted : 20.3.0 Python : 3.8.6 (default, Oct 10 2020, 07:54:55) - [GCC 5.4.0 20160609] pyOpenSSL : 19.1.0 (OpenSSL 1.1.1h 22 Sep 2020) cryptography : 3.2.1 Platform : Linux-5.8.0-7630-generic-x86_64-with-glibc2.2.5



### Additional context

I tried this with other sites and it works as intended. No difference between the two requests. However, for some reason, Craigslist is able to tell the two requests apart and identify one as coming from Scrapy, which automatically gets blocked.
Gallaecio commented 3 years ago

From your findings, could the header order be the issue? Some antibot software I believe takes that into account.

wRAR commented 3 years ago

This is one of the websites where setting DOWNLOADER_CLIENT_TLS_METHOD=TLSv1.2 helps. Still not sure what is happening.

pmdbt commented 3 years ago

@Gallaecio I'll try forcing the order of the headers as well tonight.

@wRAR I'll change that setting and test again tonight.

pmdbt commented 3 years ago

@Gallaecio I tried setting the headers to be in the same order as the one made from the requests library, but it did not resolve the issue.

@wRAR I added DOWNLOADER_CLIENT_TLS_METHOD = "TLSv1.2" in my settings.py file and the problem is now resolved. Thank you so much for the tip!

However, it would still be interesting to know why adding that setting changes things and if this helps resolve problems on other sites too, then maybe the Scrapy team should have that setting enabled by default?

Gallaecio commented 3 years ago

I don’t think we should enable it by default. But maybe we should document this as one thing to try when getting unexpected responses.

pmdbt commented 3 years ago

@Gallaecio Yeah that makes sense. Maybe if the logging is set to "DEBUG" and 403 errors are encountered, then there would be a suggestion in the logging to try this as a potential solution.

Either way, the problem is resolved for me, so if you guys want to close the issue, feel free to do so.

p475453633 commented 3 years ago

This is one of the websites where setting DOWNLOADER_CLIENT_TLS_METHOD=TLSv1.2 helps. Still not sure what is happening.

This setting doesn't work when setting proxy in method process_request of scrapy's middleware like request.meta["proxy"] = "https://xxxx:xxxx". The website I tried to get is amazon. But when I used requests.get with the same proxy and headers, it give the correct html. So curious about this question...

minhhuu291 commented 3 years ago

You could try set USER_AGENT in settings.py. For some reason setting it in the scrapy.Request doesn't work, but it works ok in setting.py.

imadmoussa1 commented 2 years ago

I have the same problem, while using scrapy the response status returned is 403. I developed a python script using requests with headers I got a response status 200. I tried to add headers using user_agent and client TLS method to settings.py , but still returning 403

USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"
DOWNLOADER_CLIENT_TLS_METHOD = "TLSv1.2"

Version : scrapy==2.4.1

wRAR commented 2 years ago

Then that's a different problem and it's impossible to guess what happened with the info you provided.

imadmoussa1 commented 2 years ago

Then that's a different problem and it's impossible to guess what happened with the info you provided.

I can send you all the info you need, can you tell me what details will help to detect the problem?

wRAR commented 2 years ago

The URL is probably enough.

imadmoussa1 commented 2 years ago

this the URL I am trying to crawl https://www.group.tv/news/category/44/%D8%A3%D8%AE%D8%A8%D8%A7%D8%B1-%D8%B3%D9%8A%D8%A7%D8%B3%D9%8A%D8%A9/ar

I used this script and get status 200:

import requests

session = requests.Session()
url = 'https://www.group.tv/news/category/44/%D8%A3%D8%AE%D8%A8%D8%A7%D8%B1-%D8%B3%D9%8A%D8%A7%D8%B3%D9%8A%D8%A9/ar'
headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}
response = session.get(url, headers=headers)
print(response.status_code)

but from scrappy :

(403) <GET https://www.group.tv/news/category/44/%D8%A3%D8%AE%D8%A8%D8%A7%D8%B1-%D8%B3%D9%8A%D8%A7%D8%B3%D9%8A%D8%A9/ar> (referer: None)
wRAR commented 2 years ago

That website is protected by the Cloudflare antibot tool so it's technically not a Scrapy problem that it's being detected. As plain requests works you can try comparing the exact requests and try reproducing them with Scrapy.

orzel commented 2 years ago

I have a similar problem on lot of different sites. All are protected by cloudflare. In all cases, requests.get() just works, and scrapy fails with 403. It works with chrome/firefox as well, not even a captcha or anything "visibly" related to cloudflare. It fails with curl (command line, not the php stuff).

I tried all of this together:

And all of that fails. The last one i really wonder why, but hey..

My (wild) guess is the following: there are only two remaining points on which cloudflare can base its accept/deny policy:

I tried, but failed, to create a DOWNLOADER_MIDDLEWARES that would use requests.get() to fetch the pages. Has anyone ever done this ?

imadmoussa1 commented 2 years ago

I tried to replace requests.get() in scrappy on a different level, but I wasn't able to override the default fucntion. Hope Scrapy team have a solution?, it seems a Cloudflare update was able to block the scrapy request, all the website having this Cloudflare version are block the scrappy request. my scrapy engine now is useless. 80% of website trying to scrap was able to block the request.

Gallaecio commented 2 years ago

I tried, but failed, to create a DOWNLOADER_MIDDLEWARES that would use requests.get() to fetch the pages. Has anyone ever done this ?

Sounds interesting as a proof of concept. Since requests seems to be thread-safe, using deferToThread to make requests calls may be feasible.

That said, ideally we should figure out what the key difference is, and provide a way to reproduce requests requests with Scrapy.

orzel commented 2 years ago

That said, ideally we should figure out what the key difference is, and provide a way to reproduce requests requests with Scrapy.

I agree 100%. Is there a place where we could talk more 'instantaneously' ? irc or similar ? As said, I'm pretty sure the difference is in order and/or case of headers. But i'm not familiar enough with scrapy code to be able to go further.

wRAR commented 2 years ago

@orzel we have IRC (#scrapy at Libera.Chat) and Discord.

Gallaecio commented 2 years ago

I'm pretty sure the difference is in order and/or case of headers.

For those we have https://github.com/scrapy/scrapy/issues/2711 and https://github.com/scrapy/scrapy/issues/2803, so if that’s the case we could probably close this issue in favor of those.

yance-dev commented 2 years ago

The same headers work will with 'requests',but failed with 'scrapy' for the sites under protection by cloudflare.

wRAR commented 2 years ago

@hyyc554 that's what the previous messages here say, yeah.

msenior85 commented 2 years ago

I had the same problem and after I created a custom download middleware as below that deletes the Accept-Language header value, the request was successful. I think this header might be causing the issue. It is also the only difference between the headers sent by requests and the headers sent by scrapy from @pmdbt code.

middlewares.py

class CustomDownloadMiddleware(object):
    def process_request(self, request, spider):
        del request.headers['Accept-Language']

settings.py

DOWNLOADER_MIDDLEWARES = {'myproject.middlewares.CustomDownloadMiddleware': 705,}
megalancast commented 2 years ago

and scrapy fails with 403. It works with chrome/firefox as well, not even a captcha or anything "visibly" related to cloudflare. It fails with curl (command line, not the php stuff).

I am experiencing the same issue and I think this has something to do with TLS fingerprint detection... https://pixeljets.com/blog/scrape-ninja-bypassing-cloudflare-403-code-1020-errors/ passing reqs through this solution helped me for some websites (not fixing the issue for all of them though)

ryanshrott commented 2 years ago

@msenior85 @megalancast I'm still getting the issue. Any long term solutions?

ryanshrott commented 2 years ago

@Gallaecio I have an example of the issue here: https://github.com/ryanshrott/scraping/tree/master/demo_airbnb

This works perfect: https://github.com/ryanshrott/scraping/blob/master/demo_airbnb/realtorapi.py

But this throws a 403. My goal is to replicate the above in scrapy. https://github.com/ryanshrott/scraping/blob/master/demo_airbnb/demo_airbnb/spiders/realtor.py

msenior85 commented 2 years ago

@ryanshrott I have run your spider and it executes fine. See the logs here

dragospopa420 commented 2 years ago

Solved this issue: Add on your class the following : headers = { 'User-Agent': 'some real user agent', 'Accept': '*/*', 'Accept-Encoding': 'gzip,deflate,br', 'Connection': 'keep-alive' }

And also on your class : custom_settings = { 'DEFAULT_REQUEST_HEADERS': headers'

This is the starting point to match the headers from the requests library. If this still fails try to add autothrottle, add a start delay between 1-3 seconds and lower the maximum delay to something real like : 0.5 - 1.5 seconds (the default is 60 is enormous).

@Gallaecio I was also thinking that it's better to change default the Scrapy default headers to something more general like the ones I posted above (without the user agent, each user of Scrapy should add this).

wRAR commented 2 years ago

"The website works if you pass some additional headers" is a trivial case and is not what this issue is about.

As for adding these to the default settings, I don't think replacing the Accept value with */* is a good idea and the Accept-Encoding value is handled by HttpCompressionMiddleware (and will be gzip,deflate,br if a brotli module is installed).

dragospopa420 commented 2 years ago

{ 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', }

Those are the default ones...they have a million reasons to go wrong. I was thinking that a more general approach would fit better.

wRAR commented 2 years ago

Yes, the default ones are not always right, just like the ones you proposed. We have an easy way to customize them.

wolfassi123 commented 1 year ago

I've been having the same issue for a while now as well. Some crawlers work while others return : Ignoring response <403 url>: HTTP status code is not handled or not allowed. I've tried adding all of the headers, and tried some solutions mentioned here (Like adding: custom_settings = {'DEFAULT_REQUEST_HEADERS': 'headers'} ) but so far nothing have worked. I tried with both removing the cookie from the headers dict and keeping it. So far I still haven't found a fix for the issue. Did anyone manage to found a way to crawl such pages?

wRAR commented 1 year ago

As previous comments say, it's not always possible with plain Scrapy at all.

wolfassi123 commented 1 year ago

As previous comments say, it's not always possible with plain Scrapy at all.

What are some of the procedures that could allow me to solve the issue then?

wRAR commented 1 year ago

The only answer I can give with the info you provided is "It depends".

But assuming requests works for you (as you are commenting in this issue), you can use it instead, or together with Scrapy.

milicamilivojevic commented 7 months ago

The only answer I can give with the info you provided is "It depends".

But assuming requests works for you (as you are commenting in this issue), you can use it instead, or together with Scrapy.

Is there a chance that Scrapy will fix this issue? This should not be closed.

Why we can't reproduce the same request like with requests.post() with Scrapy, and to get exactly the same response? This is a huge issue.

wRAR commented 7 months ago

Is there a chance that Scrapy will fix this issue?

There are several issues here, and at least some of them are caused by things not under the Scrapy's control.

This should not be closed.

It can be closed if all actual problems have their own issues filed.

Why we can't reproduce the same request like with requests.post() with Scrapy, and to get exactly the same response?

Usually because of #2711 and #2803, already mentioned in the earlier comments, but also because even identical HTTP requests can differ in e.g. HTTPS parameters which usually can't be made identical.

milicamilivojevic commented 7 months ago

Thank you for your response. I tested my code with 5 different HTTP clients (curl, wget, go, python requests, axios), they all work. Only Scrapy request doesn't. There are more and more websites where I have the same issue. Since my infrastructure is completely in Scrapy I really want to keep it. For me, Scrapy is very good because you can easily adjust concurrent requests etc. In combination with requests I couldn't find a way to make concurrent requests easily. Please let me know if you have an example of how to do it.

wRAR commented 7 months ago

Please let me know if you have an example of how to do it.

To do what, sorry?

milicamilivojevic commented 7 months ago

how to use requests.post() in Scrapy. Will it work if I make a custom download middleware or something like that?

wRAR commented 7 months ago

A middleware or a download handler, I guess. I don't have examples, though I seem to remember some of the related issues had a proof of concept.

lucaberardi0 commented 7 months ago

I encountered the same problem while scraping a site and after a while I was able to solve it. In your python code with requests, after obtaining the response, print the request headers in the console like this: response.request.headers and use them for the request with scrapy. I think the problem is due to the fact that by default requests adds headers that avoid antibots.

wRAR commented 7 months ago

Again, as I commented earlier, "The website works if you pass some additional headers" is a trivial case and is not what this issue is about.

lucaberardi0 commented 7 months ago

so what's your solution?

wRAR commented 7 months ago

@lucaberardi0 I don't think this question makes sense? You already have a solution for your trivial case.

ArtemSerdechnyi commented 5 months ago

Hello everyone, I've also encountered this issue. I've written a middleware to address this problem, and it sends requests using the aiohttp library. Here's the link to the library: https://github.com/ArtemSerdechnyi/scrapy-aiohttp