Closed gmeans closed 7 years ago
I can't reproduce but there are reports from time to time about errors downloading https urls. May be the Failure instance and wrapped Exception has more details to spot why it doesn't work.
Another useful trick to debug SSL issues is trying to reproduce the problem with openssl s_client
command.
$ scrapy version -v
2015-08-12 13:41:08 [scrapy] INFO: Scrapy 1.0.3 started (bot: testspiders)
2015-08-12 13:41:08 [scrapy] INFO: Optional features available: ssl, http11
2015-08-12 13:41:08 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'testspiders.spiders', 'RETRY_ENABLED': False, 'SPIDER_MODULES': ['testspiders.spiders'], 'BOT_NAME': 'testspiders', 'CLOSESPIDER_TIMEOUT': 3600, 'COOKIES_ENABLED': False, 'CLOSESPIDER_PAGECOUNT': 1000}
Scrapy : 1.0.3
lxml : 3.4.4.0
libxml2 : 2.9.2
Twisted : 15.3.0
Python : 2.7.10 (default, May 26 2015, 04:16:29) - [GCC 5.1.0]
Platform: Linux-4.1.4-1-ARCH-x86_64-with-glibc2.2.5
$ scrapy shell https://vconnections.org/resources
2015-08-12 13:41:32 [scrapy] INFO: Scrapy 1.0.3 started (bot: testspiders)
2015-08-12 13:41:32 [scrapy] INFO: Optional features available: ssl, http11
2015-08-12 13:41:32 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'testspiders.spiders', 'LOGSTATS_INTERVAL': 0, 'RETRY_ENABLED': False, 'SPIDER_MODULES': ['testspiders.spiders'], 'BOT_NAME': 'testspiders', 'CLOSESPIDER_TIMEOUT': 3600, 'COOKIES_ENABLED': False, 'CLOSESPIDER_PAGECOUNT': 1000}
2015-08-12 13:41:32 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, CoreStats, SpiderState
2015-08-12 13:41:32 [scrapy] INFO: Enabled downloader middlewares: RandomUserAgent, ErrorMonkeyMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-08-12 13:41:32 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-08-12 13:41:32 [scrapy] INFO: Enabled item pipelines:
2015-08-12 13:41:32 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-08-12 13:41:32 [scrapy] INFO: Spider opened
2015-08-12 13:41:34 [scrapy] DEBUG: Crawled (200) <GET https://vconnections.org/resources> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x7fa98057f050>
[s] item {}
[s] request <GET https://vconnections.org/resources>
[s] response <200 https://vconnections.org/resources>
[s] settings <scrapy.settings.Settings object at 0x7fa977e34b90>
[s] spider <DefaultSpider 'default' at 0x7fa977550ed0>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
2015-08-12 13:41:34 [root] DEBUG: Using default logger
2015-08-12 13:41:34 [root] DEBUG: Using default logger
>>> import OpenSSL.version
>>> OpenSSL.version.__version__
'0.15.1'
$ yaourt -Q openssl
core/openssl 1.0.2.d-1
I'm not really sure how to read the output of s_client, but I don't think this is correct:
▶ openssl s_client -showcerts -connect vconnections.org:443
CONNECTED(00000003)
140735093723984:error:14077438:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert internal error:s23_clnt.c:769:
---
no peer certificate available
---
No client certificate CA names sent
---
SSL handshake has read 7 bytes and written 318 bytes
---
New, (NONE), Cipher is (NONE)
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
---
I get the same result from the server where its running as well (Docker Container):
root@03517b63d725:/home/app/spider# openssl s_client -showcerts -connect vconnections.org:443
CONNECTED(00000003)
140189365540512:error:14077438:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert internal error:s23_clnt.c:770:
---
no peer certificate available
---
No client certificate CA names sent
---
SSL handshake has read 7 bytes and written 305 bytes
---
New, (NONE), Cipher is (NONE)
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
---
I also found this bug on Ubuntu and I wonder if that's what I'm running into:
https://bugs.launchpad.net/ubuntu/+source/openssl/+bug/861137
I see you are running Arch linux, any chance someone could try this on Ubuntu or OSX?
scrapy shell https://vconnections.org/resources
For me (OS X) it also fails with [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
. Package versions:
Scrapy : 1.1.0dev1
lxml : 3.4.4.0
libxml2 : 2.9.0
Twisted : 15.3.0
Python : 2.7.6 (default, Nov 25 2013, 05:33:13) - [GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)]
pyOpenSSL : 0.15.1 (OpenSSL 0.9.8zf 19 Mar 2015)
Platform : Darwin-14.4.0-x86_64-i386-64bit
openssl output:
openssl s_client -showcerts -connect vconnections.org:443
CONNECTED(00000003)
91269:error:14077438:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert internal error:/SourceCache/OpenSSL098/OpenSSL098-52.30.1/src/ssl/s23_clnt.c:593:
I am having this same issue. I can access http pages through a proxy but not https pages.
So after some more googling I found this regarding Cloudflare and SNI:
https://enc.com.au/2015/06/08/checking-cloudflare-ssl/
Using the command mentioned there, I had a successful result.
openssl s_client -connect vconnections.org:443 -servername vconnections.org
Is there a way to use ContextFactory and set that servername parameter per connection? (Sorry, not familiar with Twisted. I just saw the ContextFactory used in the issue I linked initially)
I think https://github.com/scrapy/scrapy/issues/981 has a PR related to TLS and SNI as well?
https://github.com/scrapy/scrapy/blob/1.0.3/scrapy/core/downloader/contextfactory.py#L21
Is there an option or anything that needs to be set for the getContext method to get the hostname?
Using fetch w/ the PDB option:
scrapy fetch --pdb https://vconnections.org
I can get the actual error:
2015-08-14 10:00:07 [scrapy] DEBUG: Retrying <GET https://vconnections.org/> (failed 2 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
Jumping into debugger for post-mortem of exception '[('SSL routines', 'SSL3_READ_BYTES', 'sslv3 alert handshake failure'), ('SSL routines', 'SSL3_WRITE_BYTES', 'ssl handshake failure')]':
So what I think is happening is OpenSSL is attempting an SSLv3 handshake even though the TLSv1 method is selected. The issue with this website is that it's using CloudFlare's SSL and SSLv3 is completely disabled.
I tried adding the following option:
ctx.set_options(SSL.OP_NO_SSLv3)
But it still looks like an SSLv3 handshake is attempted.
ok so after a lengthy pow-wow over on the PyOpenSSL dev channel we got this figured out.
First, the reason for the error is CloudFlare only supports TLS 1.2 and not TLS 1.0. How this worked on @dangra's machine I'm not sure. It was necessary for me to change the method in the context factory to SSLv23_METHOD. Per the PyOpenSSL guys this is a poorly named option that allows protocol negotiation.
You have to be sure the insecure protocols are disabled, however. Looking at Twisted's code these seem to be covered, and honestly I'm not sure how critical that is in the domain of a crawler.
I ran into another interesting issue that may explain the sporadic HTTPS issues @dangra initially mentioned. On OSX PyOpenSSL actually ends up bound to the original system OpenSSL (0.9.8z). This version of OpenSSL doesn't support TLS 1.1 or TLS 1.2 so even after switching the protocol methods I wasn't able to connect initially.
To fix that and bind PyOpenSSL to my homebrew installed OpenSSL I had to do the following:
rm -rf ~/Library/Caches/pip
pip uninstall cryptography
ARCHFLAGS="-arch x86_64" LDFLAGS="-L$(brew --prefix openssl)/lib" CFLAGS="-I$(brew --prefix openssl)/include" pip install cryptography
Hope this can help someone else dealing with the same issues.
Aside of referencing your findings from a FAQ entry, the only long term solution I can think is in the lines of #1435.
nice job debugging the issue and thanks for sharing your findings!
@gmeans could you paste here what your final contextfactory file ended up looking like? Also, it sounds like your solution necessitates downloading OpenSSL through homebrew? Any side affects of uninstall cryptography for pip?
Sure @wilsoncusack .
Context Factory, really simple in the end:
from OpenSSL import SSL
from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory
class CustomContextFactory(ScrapyClientContextFactory):
"""
Custom context factory that allows SSL negotiation.
"""
def __init__(self):
# Use SSLv23_METHOD so we can use protocol negotiation
self.method = SSL.SSLv23_METHOD
Then make sure you update the settings.py:
DOWNLOADER_CLIENTCONTEXTFACTORY = 'spider.contexts.CustomContextFactory'
Yes I had to update OpenSSL via Homebrew for this to work. That's because Apple has stopped using OpenSSL and switched to their own libraries.
No side effect I've seen, but I did this in a virtualenv.
@gmeans thanks for the help. Unfortunately for me, if the fix is contingent on that homebrew piece, I'm not sure this will fix it for running the commands on a Heroku dyno, which is what I'm trying to do.
@wilsoncusack that homebrew step is OSX only and is due to Apple's installed OpenSSL version not being the latest and greatest.
I'd imagine Heroku's dynos would have a more up to date version of OpenSSL installed. I found this on StackOverflow if you are having an OpenSSL version issue.
http://stackoverflow.com/questions/26644020/upgrade-openssl-on-heroku
Thanks again.
Just posting this here incase anyone else stumbles across it. Running things on a Heroku Dyno, Scrapy's 1.0.3 pyopenssl requirement, 0.15.1, fails to install because the dyno lacks the libffi-dev. I am not sure how to remedy this on the one off heroku dynos, probably a buildpack. I did get things to work again, though, using the following requirements. No other changes were necessary.
Scrapy==1.0.3
pyopenssl==0.13
I am having this issue as well. I decided to try gmeans Aug 14 suggestion of using fetch with --pdb in order to see if I would get a specific SSL version error message. I targeted a single page (out of many) that had come back with the <twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>...., and the whole page downloaded! I was definitely not expecting that result. Any insights?
FWIW, this site opens fine in a browser, but I suppose you all already knew that.
Could this be a difference between my scripted spider and running fetch instead of crawl? How so? Why did gmeans just get an error message and I got the whole page? Is there a clue here to help resolve this issue?
Then I decided to re-run the spider, but it looks like all I got were duplicate messages.
Then I commented out the pipeline and sent the output to a json file. The file was created, but the only thing in it is an opening square brace: ' [ '.
This is starting to look more like a support issue that should be posted to the scrapy group and/or Stack Overflow instead of a contribution to tracking a bug, but here it is anyway. I don't understand the intermittent nature of the SSL error. Actually, I guess I don't understand anything about what went wrong here.
Now that #1794 is merged (and available in scrapy 1.1.0), I'm closing the original issue related to SNI.
The openssl issue with homebrew is covered by a new FAQ entry in https://github.com/scrapy/scrapy/pull/1909
@gmeans code gives warning:
/usr/local/lib/python3.6/site-packages/scrapy/core/downloader/handlers/http11.py:51: builtins.UserWarning:
'broadd.context.CustomContextFactory' does not accept method
argument (type OpenSSL.SSL method, e.g. OpenSSL.SSL.SSLv23_METHOD). Please upgrade your context factory class to handle it or ignore it.
@pythoncontrol , you don't need the fix from this issue anymore. Scrapy allows passing the SSL/TLS method to force it (by default it should tell Twisted to negotiate "best" (most secure) option. See https://docs.scrapy.org/en/latest/topics/settings.html#downloader-client-tls-method
@redapple I had errors:
<twisted.python.failure.Failure OpenSSL.SSL.Error: ('SSL routines', 'SSL3_READ_BYTES', 'sslv3 alert handshake failure'), ('SSL routines', 'SSL3_WRITE_BYTES', 'ssl handshake failure')>
with all packages up-to-date
what version of OpenSSL are you using (share the output of scrapy version -v
)? I know there's an issue with OpenSSL 1.1 (see https://twistedmatrix.com/pipermail/twisted-web/2017-April/005293.html)
If you send me your target URL, I can have a closer look. You may also open a new issue.
Scrapy : 1.4.0 lxml : 3.8.0.0 libxml2 : 2.9.4 cssselect : 1.0.1 parsel : 1.2.0 w3lib : 1.17.0 Twisted : 17.5.0 Python : 3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 17:54:52) [MSC v.1900 32 bit (Intel)] pyOpenSSL : 17.1.0 (OpenSSL 1.1.0f 25 May 2017) Platform : Windows-10-10.0.15063-SP0
Ok, so I believe you're seeing the same as #2717, which needs a fix in Twisted.
@redapple Thank you
i had error:
2018-03-12 12:01:01 [scrapy.core.engine] INFO: Spider opened
2018-03-12 12:01:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-03-12 12:01:01 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-03-12 12:01:22 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.zomato.com/ncr/dine-out-in-sector-18?page=1> (failed 1 times): [
Sure @wilsoncusack .
Context Factory, really simple in the end:
from OpenSSL import SSL from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory class CustomContextFactory(ScrapyClientContextFactory): """ Custom context factory that allows SSL negotiation. """ def __init__(self): # Use SSLv23_METHOD so we can use protocol negotiation self.method = SSL.SSLv23_METHOD
Then make sure you update the settings.py:
DOWNLOADER_CLIENTCONTEXTFACTORY = 'spider.contexts.CustomContextFactory'
Yes I had to update OpenSSL via Homebrew for this to work. That's because Apple has stopped using OpenSSL and switched to their own libraries.
No side effect I've seen, but I did this in a virtualenv.
Hi I am new to Scrapy. Where have you stored this file? and with what name? Also is spider your bot-name?
Sure @wilsoncusack . Context Factory, really simple in the end:
from OpenSSL import SSL from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory class CustomContextFactory(ScrapyClientContextFactory): """ Custom context factory that allows SSL negotiation. """ def __init__(self): # Use SSLv23_METHOD so we can use protocol negotiation self.method = SSL.SSLv23_METHOD
Then make sure you update the settings.py:
DOWNLOADER_CLIENTCONTEXTFACTORY = 'spider.contexts.CustomContextFactory'
Yes I had to update OpenSSL via Homebrew for this to work. That's because Apple has stopped using OpenSSL and switched to their own libraries. No side effect I've seen, but I did this in a virtualenv.
Hi I am new to Scrapy. Where have you stored this file? and with what name? Also is spider your bot-name?
Just set the DOWNLOADER_CLIENT_TLS_METHOD
property to 'TLSv1.2'
in the settings.py of your project. There is no more need for you to use the custom context factory to solve this problem.
Sure @wilsoncusack .
Context Factory, really simple in the end:
from OpenSSL import SSL from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory class CustomContextFactory(ScrapyClientContextFactory): """ Custom context factory that allows SSL negotiation. """ def __init__(self): # Use SSLv23_METHOD so we can use protocol negotiation self.method = SSL.SSLv23_METHOD
Then make sure you update the settings.py:
DOWNLOADER_CLIENTCONTEXTFACTORY = 'spider.contexts.CustomContextFactory'
Yes I had to update OpenSSL via Homebrew for this to work. That's because Apple has stopped using OpenSSL and switched to their own libraries.
No side effect I've seen, but I did this in a virtualenv.
Friends, note that this method is obsolete
i am having the same problem, did you solve it?
If you find a URL that you can access with a modern web browser but you cannot access with Scrapy due to an issue like this, please raise a separate issue about it. Comments on closed issues get lost, like tears in rain.
I have a spider that's throwing the following error when trying to crawl this URL.
Other SSL urls work fine, and I tried implementing the solution from this previous issue:
https://github.com/scrapy/scrapy/issues/981
Scrapy==1.0.3 Twisted==15.3.0 pyOpenSSL==0.15.1
OpenSSL 1.0.1k 8 Jan 2015
Any ideas on what else I could try? Thanks!