scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

SSL issue when scraping website #1429

Closed gmeans closed 7 years ago

gmeans commented 8 years ago

I have a spider that's throwing the following error when trying to crawl this URL.

>>> fetch('https://vconnections.org/resources')
2015-08-12 10:07:28 [scrapy] INFO: Spider opened
2015-08-12 10:07:28 [scrapy] DEBUG: Retrying <GET https://vconnections.org/resources> (failed 1 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2015-08-12 10:07:33 [scrapy] DEBUG: Gave up retrying <GET https://vconnections.org/resources> (failed 2 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/Users/gmeans/.virtualenvs/backlink/lib/python2.7/site-packages/scrapy/shell.py", line 87, in fetch
    reactor, self._schedule, request, spider)
  File "/Users/gmeans/.virtualenvs/backlink/lib/python2.7/site-packages/twisted/internet/threads.py", line 122, in blockingCallFromThread
    result.raiseException()
  File "<string>", line 2, in raiseException
ResponseNeverReceived: [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]

Other SSL urls work fine, and I tried implementing the solution from this previous issue:

https://github.com/scrapy/scrapy/issues/981

class CustomContextFactory(ScrapyClientContextFactory):
    def getContext(self, hostname=None, port=None):
        ctx = ClientContextFactory.getContext(self)
        # Enable all workarounds to SSL bugs as documented by
        # http://www.openssl.org/docs/ssl/SSL_CTX_set_options.html
        ctx.set_options(SSL.OP_ALL)
        if hostname:
            ClientTLSOptions(hostname, ctx)
        return ctx

Scrapy==1.0.3 Twisted==15.3.0 pyOpenSSL==0.15.1

OpenSSL 1.0.1k 8 Jan 2015

Any ideas on what else I could try? Thanks!

dangra commented 8 years ago

I can't reproduce but there are reports from time to time about errors downloading https urls. May be the Failure instance and wrapped Exception has more details to spot why it doesn't work.

Another useful trick to debug SSL issues is trying to reproduce the problem with openssl s_client command.

$ scrapy version -v 
2015-08-12 13:41:08 [scrapy] INFO: Scrapy 1.0.3 started (bot: testspiders)
2015-08-12 13:41:08 [scrapy] INFO: Optional features available: ssl, http11
2015-08-12 13:41:08 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'testspiders.spiders', 'RETRY_ENABLED': False, 'SPIDER_MODULES': ['testspiders.spiders'], 'BOT_NAME': 'testspiders', 'CLOSESPIDER_TIMEOUT': 3600, 'COOKIES_ENABLED': False, 'CLOSESPIDER_PAGECOUNT': 1000}
Scrapy  : 1.0.3
lxml    : 3.4.4.0
libxml2 : 2.9.2
Twisted : 15.3.0
Python  : 2.7.10 (default, May 26 2015, 04:16:29) - [GCC 5.1.0]
Platform: Linux-4.1.4-1-ARCH-x86_64-with-glibc2.2.5

$ scrapy shell https://vconnections.org/resources
2015-08-12 13:41:32 [scrapy] INFO: Scrapy 1.0.3 started (bot: testspiders)
2015-08-12 13:41:32 [scrapy] INFO: Optional features available: ssl, http11
2015-08-12 13:41:32 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'testspiders.spiders', 'LOGSTATS_INTERVAL': 0, 'RETRY_ENABLED': False, 'SPIDER_MODULES': ['testspiders.spiders'], 'BOT_NAME': 'testspiders', 'CLOSESPIDER_TIMEOUT': 3600, 'COOKIES_ENABLED': False, 'CLOSESPIDER_PAGECOUNT': 1000}
2015-08-12 13:41:32 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, CoreStats, SpiderState
2015-08-12 13:41:32 [scrapy] INFO: Enabled downloader middlewares: RandomUserAgent, ErrorMonkeyMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-08-12 13:41:32 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-08-12 13:41:32 [scrapy] INFO: Enabled item pipelines: 
2015-08-12 13:41:32 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-08-12 13:41:32 [scrapy] INFO: Spider opened
2015-08-12 13:41:34 [scrapy] DEBUG: Crawled (200) <GET https://vconnections.org/resources> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fa98057f050>
[s]   item       {}
[s]   request    <GET https://vconnections.org/resources>
[s]   response   <200 https://vconnections.org/resources>
[s]   settings   <scrapy.settings.Settings object at 0x7fa977e34b90>
[s]   spider     <DefaultSpider 'default' at 0x7fa977550ed0>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
2015-08-12 13:41:34 [root] DEBUG: Using default logger
2015-08-12 13:41:34 [root] DEBUG: Using default logger

>>> import OpenSSL.version
>>> OpenSSL.version.__version__
'0.15.1'

$ yaourt -Q openssl
core/openssl 1.0.2.d-1
gmeans commented 8 years ago

I'm not really sure how to read the output of s_client, but I don't think this is correct:

▶ openssl s_client -showcerts -connect vconnections.org:443
CONNECTED(00000003)
140735093723984:error:14077438:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert internal error:s23_clnt.c:769:
---
no peer certificate available
---
No client certificate CA names sent
---
SSL handshake has read 7 bytes and written 318 bytes
---
New, (NONE), Cipher is (NONE)
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
---

I get the same result from the server where its running as well (Docker Container):

root@03517b63d725:/home/app/spider# openssl s_client -showcerts -connect vconnections.org:443
CONNECTED(00000003)
140189365540512:error:14077438:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert internal error:s23_clnt.c:770:
---
no peer certificate available
---
No client certificate CA names sent
---
SSL handshake has read 7 bytes and written 305 bytes
---
New, (NONE), Cipher is (NONE)
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
---

I also found this bug on Ubuntu and I wonder if that's what I'm running into:

https://bugs.launchpad.net/ubuntu/+source/openssl/+bug/861137

I see you are running Arch linux, any chance someone could try this on Ubuntu or OSX?

scrapy shell https://vconnections.org/resources
kmike commented 8 years ago

For me (OS X) it also fails with [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]. Package versions:

Scrapy    : 1.1.0dev1
lxml      : 3.4.4.0
libxml2   : 2.9.0
Twisted   : 15.3.0
Python    : 2.7.6 (default, Nov 25 2013, 05:33:13) - [GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)]
pyOpenSSL : 0.15.1 (OpenSSL 0.9.8zf 19 Mar 2015)
Platform  : Darwin-14.4.0-x86_64-i386-64bit

openssl output:

openssl s_client -showcerts -connect vconnections.org:443
CONNECTED(00000003)
91269:error:14077438:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert internal error:/SourceCache/OpenSSL098/OpenSSL098-52.30.1/src/ssl/s23_clnt.c:593:
teddb commented 8 years ago

I am having this same issue. I can access http pages through a proxy but not https pages.

gmeans commented 8 years ago

So after some more googling I found this regarding Cloudflare and SNI:

https://enc.com.au/2015/06/08/checking-cloudflare-ssl/

Using the command mentioned there, I had a successful result.

openssl s_client -connect vconnections.org:443 -servername vconnections.org

Is there a way to use ContextFactory and set that servername parameter per connection? (Sorry, not familiar with Twisted. I just saw the ContextFactory used in the issue I linked initially)

I think https://github.com/scrapy/scrapy/issues/981 has a PR related to TLS and SNI as well?

https://github.com/scrapy/scrapy/blob/1.0.3/scrapy/core/downloader/contextfactory.py#L21

Is there an option or anything that needs to be set for the getContext method to get the hostname?

gmeans commented 8 years ago

Using fetch w/ the PDB option:

scrapy fetch --pdb https://vconnections.org

I can get the actual error:

2015-08-14 10:00:07 [scrapy] DEBUG: Retrying <GET https://vconnections.org/> (failed 2 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
Jumping into debugger for post-mortem of exception '[('SSL routines', 'SSL3_READ_BYTES', 'sslv3 alert handshake failure'), ('SSL routines', 'SSL3_WRITE_BYTES', 'ssl handshake failure')]':

So what I think is happening is OpenSSL is attempting an SSLv3 handshake even though the TLSv1 method is selected. The issue with this website is that it's using CloudFlare's SSL and SSLv3 is completely disabled.

I tried adding the following option:

ctx.set_options(SSL.OP_NO_SSLv3)

But it still looks like an SSLv3 handshake is attempted.

gmeans commented 8 years ago

ok so after a lengthy pow-wow over on the PyOpenSSL dev channel we got this figured out.

First, the reason for the error is CloudFlare only supports TLS 1.2 and not TLS 1.0. How this worked on @dangra's machine I'm not sure. It was necessary for me to change the method in the context factory to SSLv23_METHOD. Per the PyOpenSSL guys this is a poorly named option that allows protocol negotiation.

You have to be sure the insecure protocols are disabled, however. Looking at Twisted's code these seem to be covered, and honestly I'm not sure how critical that is in the domain of a crawler.

I ran into another interesting issue that may explain the sporadic HTTPS issues @dangra initially mentioned. On OSX PyOpenSSL actually ends up bound to the original system OpenSSL (0.9.8z). This version of OpenSSL doesn't support TLS 1.1 or TLS 1.2 so even after switching the protocol methods I wasn't able to connect initially.

To fix that and bind PyOpenSSL to my homebrew installed OpenSSL I had to do the following:

rm -rf ~/Library/Caches/pip
pip uninstall cryptography
ARCHFLAGS="-arch x86_64" LDFLAGS="-L$(brew --prefix openssl)/lib" CFLAGS="-I$(brew --prefix openssl)/include" pip install cryptography

Hope this can help someone else dealing with the same issues.

dangra commented 8 years ago

Aside of referencing your findings from a FAQ entry, the only long term solution I can think is in the lines of #1435.

dangra commented 8 years ago

nice job debugging the issue and thanks for sharing your findings!

wilsoncusack commented 8 years ago

@gmeans could you paste here what your final contextfactory file ended up looking like? Also, it sounds like your solution necessitates downloading OpenSSL through homebrew? Any side affects of uninstall cryptography for pip?

gmeans commented 8 years ago

Sure @wilsoncusack .

Context Factory, really simple in the end:

from OpenSSL import SSL
from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory

class CustomContextFactory(ScrapyClientContextFactory):
    """
    Custom context factory that allows SSL negotiation.
    """

    def __init__(self):
        # Use SSLv23_METHOD so we can use protocol negotiation
        self.method = SSL.SSLv23_METHOD

Then make sure you update the settings.py:

DOWNLOADER_CLIENTCONTEXTFACTORY = 'spider.contexts.CustomContextFactory'

Yes I had to update OpenSSL via Homebrew for this to work. That's because Apple has stopped using OpenSSL and switched to their own libraries.

No side effect I've seen, but I did this in a virtualenv.

wilsoncusack commented 8 years ago

@gmeans thanks for the help. Unfortunately for me, if the fix is contingent on that homebrew piece, I'm not sure this will fix it for running the commands on a Heroku dyno, which is what I'm trying to do.

gmeans commented 8 years ago

@wilsoncusack that homebrew step is OSX only and is due to Apple's installed OpenSSL version not being the latest and greatest.

I'd imagine Heroku's dynos would have a more up to date version of OpenSSL installed. I found this on StackOverflow if you are having an OpenSSL version issue.

http://stackoverflow.com/questions/26644020/upgrade-openssl-on-heroku

wilsoncusack commented 8 years ago

Thanks again.

Just posting this here incase anyone else stumbles across it. Running things on a Heroku Dyno, Scrapy's 1.0.3 pyopenssl requirement, 0.15.1, fails to install because the dyno lacks the libffi-dev. I am not sure how to remedy this on the one off heroku dynos, probably a buildpack. I did get things to work again, though, using the following requirements. No other changes were necessary.

Scrapy==1.0.3
pyopenssl==0.13
MalikRumi commented 8 years ago

I am having this issue as well. I decided to try gmeans Aug 14 suggestion of using fetch with --pdb in order to see if I would get a specific SSL version error message. I targeted a single page (out of many) that had come back with the <twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>...., and the whole page downloaded! I was definitely not expecting that result. Any insights?

FWIW, this site opens fine in a browser, but I suppose you all already knew that.

Could this be a difference between my scripted spider and running fetch instead of crawl? How so? Why did gmeans just get an error message and I got the whole page? Is there a clue here to help resolve this issue?

Then I decided to re-run the spider, but it looks like all I got were duplicate messages.

Then I commented out the pipeline and sent the output to a json file. The file was created, but the only thing in it is an opening square brace: ' [ '.

This is starting to look more like a support issue that should be posted to the scrapy group and/or Stack Overflow instead of a contribution to tracking a bug, but here it is anyway. I don't understand the intermittent nature of the SSL error. Actually, I guess I don't understand anything about what went wrong here.

redapple commented 7 years ago

Now that #1794 is merged (and available in scrapy 1.1.0), I'm closing the original issue related to SNI.

The openssl issue with homebrew is covered by a new FAQ entry in https://github.com/scrapy/scrapy/pull/1909

ghost commented 6 years ago

@gmeans code gives warning:

/usr/local/lib/python3.6/site-packages/scrapy/core/downloader/handlers/http11.py:51: builtins.UserWarning: 'broadd.context.CustomContextFactory' does not accept method argument (type OpenSSL.SSL method, e.g. OpenSSL.SSL.SSLv23_METHOD). Please upgrade your context factory class to handle it or ignore it.

redapple commented 6 years ago

@pythoncontrol , you don't need the fix from this issue anymore. Scrapy allows passing the SSL/TLS method to force it (by default it should tell Twisted to negotiate "best" (most secure) option. See https://docs.scrapy.org/en/latest/topics/settings.html#downloader-client-tls-method

ghost commented 6 years ago

@redapple I had errors: <twisted.python.failure.Failure OpenSSL.SSL.Error: ('SSL routines', 'SSL3_READ_BYTES', 'sslv3 alert handshake failure'), ('SSL routines', 'SSL3_WRITE_BYTES', 'ssl handshake failure')> with all packages up-to-date

redapple commented 6 years ago

what version of OpenSSL are you using (share the output of scrapy version -v)? I know there's an issue with OpenSSL 1.1 (see https://twistedmatrix.com/pipermail/twisted-web/2017-April/005293.html) If you send me your target URL, I can have a closer look. You may also open a new issue.

ghost commented 6 years ago

Scrapy : 1.4.0 lxml : 3.8.0.0 libxml2 : 2.9.4 cssselect : 1.0.1 parsel : 1.2.0 w3lib : 1.17.0 Twisted : 17.5.0 Python : 3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 17:54:52) [MSC v.1900 32 bit (Intel)] pyOpenSSL : 17.1.0 (OpenSSL 1.1.0f 25 May 2017) Platform : Windows-10-10.0.15063-SP0

redapple commented 6 years ago

Ok, so I believe you're seeing the same as #2717, which needs a fix in Twisted.

ghost commented 6 years ago

@redapple Thank you

ghost commented 6 years ago

i had error: 2018-03-12 12:01:01 [scrapy.core.engine] INFO: Spider opened 2018-03-12 12:01:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2018-03-12 12:01:01 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2018-03-12 12:01:22 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.zomato.com/ncr/dine-out-in-sector-18?page=1> (failed 1 times): [] 2018-03-12 12:01:41 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.zomato.com/ncr/dine-out-in-sector-18?page=1> (failed 2 times): [] 2018-03-12 12:02:00 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://www.zomato.com/ncr/dine-out-in-sector-18?page=1> (failed 3 times): [] 2018-03-12 12:02:00 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.zomato.com/ncr/dine-out-in-sector-18?page=1>: [] 2018-03-12 12:02:00 [scrapy.core.engine] INFO: Closing spider (finished) 2018-03-12 12:02:00 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/exception_count': 3, 'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 3, 'downloader/request_bytes': 735, 'downloader/request_count': 3, 'downloader/request_method_count/GET': 3, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2018, 3, 12, 6, 32, 0, 952000), 'log_count/DEBUG': 4, 'log_count/ERROR': 1, 'log_count/INFO': 7, 'retry/count': 2, 'retry/max_reached': 1, 'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 2, 'scheduler/dequeued': 3, 'scheduler/dequeued/memory': 3, 'scheduler/enqueued': 3, 'scheduler/enqueued/memory': 3, 'start_time': datetime.datetime(2018, 3, 12, 6, 31, 1, 144000)} 2018-03-12 12:02:00 [scrapy.core.engine] INFO: Spider closed (finished)

nikhilredij-gep commented 5 years ago

Sure @wilsoncusack .

Context Factory, really simple in the end:

from OpenSSL import SSL
from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory

class CustomContextFactory(ScrapyClientContextFactory):
    """
    Custom context factory that allows SSL negotiation.
    """

    def __init__(self):
        # Use SSLv23_METHOD so we can use protocol negotiation
        self.method = SSL.SSLv23_METHOD

Then make sure you update the settings.py:

DOWNLOADER_CLIENTCONTEXTFACTORY = 'spider.contexts.CustomContextFactory'

Yes I had to update OpenSSL via Homebrew for this to work. That's because Apple has stopped using OpenSSL and switched to their own libraries.

No side effect I've seen, but I did this in a virtualenv.

Hi I am new to Scrapy. Where have you stored this file? and with what name? Also is spider your bot-name?

bpanatta commented 5 years ago

Sure @wilsoncusack . Context Factory, really simple in the end:

from OpenSSL import SSL
from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory

class CustomContextFactory(ScrapyClientContextFactory):
    """
    Custom context factory that allows SSL negotiation.
    """

    def __init__(self):
        # Use SSLv23_METHOD so we can use protocol negotiation
        self.method = SSL.SSLv23_METHOD

Then make sure you update the settings.py:

DOWNLOADER_CLIENTCONTEXTFACTORY = 'spider.contexts.CustomContextFactory'

Yes I had to update OpenSSL via Homebrew for this to work. That's because Apple has stopped using OpenSSL and switched to their own libraries. No side effect I've seen, but I did this in a virtualenv.

Hi I am new to Scrapy. Where have you stored this file? and with what name? Also is spider your bot-name?

Just set the DOWNLOADER_CLIENT_TLS_METHOD property to 'TLSv1.2' in the settings.py of your project. There is no more need for you to use the custom context factory to solve this problem.

SardarDelha commented 1 year ago

Sure @wilsoncusack .

Context Factory, really simple in the end:

from OpenSSL import SSL
from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory

class CustomContextFactory(ScrapyClientContextFactory):
    """
    Custom context factory that allows SSL negotiation.
    """

    def __init__(self):
        # Use SSLv23_METHOD so we can use protocol negotiation
        self.method = SSL.SSLv23_METHOD

Then make sure you update the settings.py:

DOWNLOADER_CLIENTCONTEXTFACTORY = 'spider.contexts.CustomContextFactory'

Yes I had to update OpenSSL via Homebrew for this to work. That's because Apple has stopped using OpenSSL and switched to their own libraries.

No side effect I've seen, but I did this in a virtualenv.

Friends, note that this method is obsolete

hungnguyen259 commented 1 year ago

i am having the same problem, did you solve it?

Gallaecio commented 1 year ago

If you find a URL that you can access with a modern web browser but you cannot access with Scrapy due to an issue like this, please raise a separate issue about it. Comments on closed issues get lost, like tears in rain.