sslv3 alert handshake failure when making a request

scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.

https://scrapy.org

BSD 3-Clause "New" or "Revised" License

51.16k stars 10.35k forks source link

sslv3 alert handshake failure when making a request #1764

Closed lagenar closed 7 years ago

lagenar commented 8 years ago

Hi there, I recently upgraded to the latest scrapy and on some sites SSL enabled sites I get an exception when trying to make requests to it, while on previous scrapy versions I didn't have this issue. The issue can be seen by making a request with scrapy shell:

scrapy shell "https://www.gohastings.com/"

The error I get is: Retrying <GET https://www.gohastings.com/> (failed 1 times): <twisted.python.failure.Failure OpenSSL.SSL.Error: ('SSL routines', 'SSL3_READ_BYTES', 'sslv3 alert handshake failure'), ('SSL routines', 'SSL3_WRITE_BYTES', 'ssl handshake failure')>

redapple commented 8 years ago

@lagenar , I can confirm the failure with scrapy 1.0.5 (latest) and also scrapy 1.1.0rc1

Current SSL/TLS connections use TLSv1 method:

TLSv1_method(), TLSv1_server_method(), TLSv1_client_method() A TLS connection established with these methods will only understand the TLS 1.0 protocol.

The trick from https://github.com/scrapy/scrapy/issues/1429#issuecomment-131782133 worked for me. SSLv23_METHOD really means negotiation. On the wire, I see TLSv1.2 being negotiated.

Define this somewhere in your project (e.g. myproject/contextfactory.py, next to myproject/settings.py)

from OpenSSL import SSL
from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory

class TLSFlexibleContextFactory(ScrapyClientContextFactory):
    """A more protocol-flexible TLS/SSL context factory.

    A TLS/SSL connection established with [SSLv23_METHOD] may understand
    the SSLv3, TLSv1, TLSv1.1 and TLSv1.2 protocols.
    See https://www.openssl.org/docs/manmaster/ssl/SSL_CTX_new.html
    """

    def __init__(self):
        self.method = SSL.SSLv23_METHOD

and change the HTTP client context factory in your settings.py to something like DOWNLOADER_CLIENTCONTEXTFACTORY = 'myproject.contextfactory.TLSFlexibleContextFactory'

PoulTur commented 8 years ago

@Redapple, I experienced similar problem and error. The site I work with recently changed from http to https, and now the spider cannot handle it anymore. I tried your solution, but it does not work for me.

redapple commented 8 years ago

@PoulTur, can you provide a URL for this website which fails with Scrapy?

PoulTur commented 8 years ago

https://shop.clares.co.uk

redapple commented 8 years ago

@PoulTur , I'm getting HTTP 200 with https://shop.clares.co.uk/ , with vanilla scrapy 1.0.5 , without the customized context factory Can you share your logs and packages versions?

Here are the packages I have locally:

$ scrapy version -v
Scrapy    : 1.0.5
lxml      : 3.5.0.0
libxml2   : 2.9.2
Twisted   : 15.5.0
Python    : 2.7.9 (default, Apr  2 2015, 15:33:21) - [GCC 4.9.2]
pyOpenSSL : 0.15.1 (OpenSSL 1.0.2d 9 Jul 2015)
Platform  : Linux-4.2.0-27-generic-x86_64-with-Ubuntu-15.10-wily

And shell session:

$ scrapy shell 'https://shop.clares.co.uk'
2016-02-10 11:00:22 [scrapy] INFO: Scrapy 1.0.5 started (bot: sslissues)
2016-02-10 11:00:22 [scrapy] INFO: Optional features available: ssl, http11
2016-02-10 11:00:22 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'sslissues.spiders', 'SPIDER_MODULES': ['sslissues.spiders'], 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0, 'BOT_NAME': 'sslissues'}
2016-02-10 11:00:22 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, CoreStats, SpiderState
2016-02-10 11:00:22 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-02-10 11:00:22 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-02-10 11:00:22 [scrapy] INFO: Enabled item pipelines: 
2016-02-10 11:00:22 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-02-10 11:00:22 [scrapy] INFO: Spider opened
2016-02-10 11:00:25 [scrapy] DEBUG: Crawled (200) <GET https://shop.clares.co.uk> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f1864003150>
[s]   item       {}version
[s]   request    <GET https://shop.clares.co.uk>
[s]   response   <200 https://shop.clares.co.uk>
[s]   settings   <scrapy.settings.Settings object at 0x7f185c51ab10>
[s]   spider     <ClaresSpider 'clares' at 0x7f185bbc6210>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
2016-02-10 11:00:25 [root] DEBUG: Using default logger
2016-02-10 11:00:25 [root] DEBUG: Using default logger

In [1]: print response.headers.to_string()
Set-Cookie: ASP.NET_SessionId=zlavlegzdm4r33ahjw54zj1s; path=/; secure; HttpOnly
Strict-Transport-Security: max-age=300
Vary: Accept-Encoding
X-Ua-Compatible: IE=Edge
Cache-Control: private
Date: Wed, 10 Feb 2016 10:00:23 GMT
X-Frame-Options: DENY
Content-Type: text/html; charset=utf-8

PoulTur commented 8 years ago

Hi RedApple, thank you for checking this. I will send the required later today. Meanwhile, perhaps you could try do to run it few times more. I actually happened to get a 200 today morning, but only for the first request, then it didn't work. Anyways, I will post my spec later today.

lagenar commented 8 years ago

Hi @redapple , here's another site that fails despite working fine with the browser. https://www.buyagift.co.uk

Do you plan to fix this on the scrapy trunk or should we use the alternative fix? I haven't tested the alternative fix yet but I'll give it a try in a while.

redapple commented 8 years ago

@lagenar , we added #1629 to 1.1 milestone, so hopefully the core devs will agree on a good solution for most cases. By the way, the custom context factory works for https://www.buyagift.co.uk/, so one more point for #1629

PoulTur commented 8 years ago

@redapple , please see my log and my settings. If you have some hints for me, this will be great. If I ran this from Scrapy shell it worked, but does not work from a Scrapy application.

My settings:

[C:\AAA\Scrapy\TurScrapy>scrapy version -v
Scrapy  : 1.0.3
lxml    : 3.4.4.0
libxml2 : 2.9.0
Twisted : 15.5.0
Python  : 2.7.10 (default, May 23 2015, 09:40:32) [MSC v.1500 32 bit (Intel)]
Platform: Windows-8-6.2.9200](url)

and the log trace:

2016-02-10 18:45:30 [scrapy] INFO: Scrapy 1.0.3 started (bot: TurScrapy])
2016-02-10 18:45:30 [scrapy] INFO: Optional features available: ssl, http11
2016-02-10 18:45:30 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'TurScrapy.spiders', 'SPIDER_MODULES': ['TurScrapy.spiders'], 'BOT_NAME': 'TurScrapy]', 'COOKIES_ENABLED': False, 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36', 'LOG_FILE': 'c:\\AAA\\Scrapy\\TurScrapy\\Logs\\Log_10-02-2016', 'DOWNLOAD_DELAY': 2, 'DOWNLOADER_CLIENTCONTEXTFACTORY': 'TurScrapy.contextfactory.TLSFlexibleContextFactory'}
2016-02-10 18:45:30 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-02-10 18:45:31 [py.warnings] WARNING: C:\Python27\lib\site-packages\scrapy\utils\deprecate.py:155: ScrapyDeprecationWarning: `scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware` class is deprecated, use `scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware` instead
  ScrapyDeprecationWarning)

2016-02-10 18:45:31 [scrapy] INFO: Enabled downloader middlewares: ProxyMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-02-10 18:45:31 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-02-10 18:45:31 [py.warnings] WARNING: C:\AAA\Scrapy\TurScrapy\TurScrapy\pipelines.py:3: ScrapyDeprecationWarning: Module `scrapy.log` has been deprecated, Scrapy now relies on the builtin Python library for logging. Read the updated logging entry in the documentation to learn more.
  from scrapy import log

2016-02-10 18:45:31 [scrapy] INFO: Enabled item pipelines: WriteToCsvPipeline
2016-02-10 18:45:31 [scrapy] INFO: Spider opened
2016-02-10 18:45:31 [py.warnings] WARNING: C:\AAA\Scrapy\TurScrapy\TurScrapy\pipelines.py:12: ScrapyDeprecationWarning: log.msg has been deprecated, create a python logger and log through it instead
  log.msg("opened spider  %s at time %s" % (spider.name,time.strftime("%d-%m-%Y")))

2016-02-10 18:45:31 [scrapy] INFO: opened spider  clares at time 10-02-2016
2016-02-10 18:45:31 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-02-10 18:45:31 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-02-10 18:45:51 [scrapy] DEBUG: Retrying <GET https://shop.clares.co.uk/> (failed 1 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'ssl23_read', 'ssl handshake failure')]>]
2016-02-10 18:46:11 [scrapy] DEBUG: Retrying <GET https://shop.clares.co.uk/> (failed 2 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'ssl23_read', 'ssl handshake failure')]>]
2016-02-10 18:46:31 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-02-10 18:46:31 [scrapy] DEBUG: Gave up retrying <GET https://shop.clares.co.uk/> (failed 3 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'ssl23_read', 'ssl handshake failure')]>]
2016-02-10 18:46:31 [scrapy] ERROR: Error downloading <GET https://shop.clares.co.uk/>: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'ssl23_read', 'ssl handshake failure')]>]
2016-02-10 18:46:32 [scrapy] INFO: Closing spider (finished)
2016-02-10 18:46:32 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
 'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 3,
 'downloader/request_bytes': 1008,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 2, 10, 17, 46, 32, 17000),
 'log_count/DEBUG': 4,
 'log_count/ERROR': 1,
 'log_count/INFO': 9,
 'log_count/WARNING': 3,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2016, 2, 10, 17, 45, 31, 197000)}
2016-02-10 18:46:32 [scrapy] INFO: Spider closed (finished)

redapple commented 8 years ago

@PoulTur , can you also provide working (scrapy shell) version console logs? Also, upgrading to scrapy 1.0.5 would provide your OpenSSL version when running scrapy version -v (this is new in 1.0.4, 1.0.3 doesn't show it)

You seem to be using a custom ProxyMiddleware. what does it do? Is it also enabled in your scrapy shell test?

PoulTur commented 8 years ago

@redapple I did some updates. I actually get now the error from the shell as well, although it worked for me few times. I use ProxyMesh, I believe it shouldn't be hooked up if using scrapy shell. Attached is my scrapy shell log and the updated specs.

Scrapy    : 1.0.5
lxml      : 3.5.0.0
libxml2   : 2.9.3
Twisted   : 15.5.0
Python    : 2.7.10 (default, May 23 2015, 09:40:32) [MSC v.1500 32 bit (Intel)]
pyOpenSSL : 0.15.1 (OpenSSL 1.0.2d 9 Jul 2015)
Platform  : Windows-8-6.2.9200

Trace:

Traceback (most recent call last):
  File "c:\python27\lib\runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "c:\python27\lib\runpy.py", line 72, in _run_code
    exec code in run_globals
  File "c:\python27\Scripts\scrapy.exe\__main__.py", line 9, in <module>
  File "c:\python27\lib\site-packages\scrapy\cmdline.py", line 143, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "c:\python27\lib\site-packages\scrapy\cmdline.py", line 89, in _run_print
_help
    func(*a, **kw)
  File "c:\python27\lib\site-packages\scrapy\cmdline.py", line 150, in _run_comm
and
    cmd.run(args, opts)
  File "c:\python27\lib\site-packages\scrapy\commands\shell.py", line 67, in run

    shell.start(url=url)
  File "c:\python27\lib\site-packages\scrapy\shell.py", line 44, in start
    self.fetch(url, spider)
  File "c:\python27\lib\site-packages\scrapy\shell.py", line 87, in fetch
    reactor, self._schedule, request, spider)
  File "c:\python27\lib\site-packages\twisted\internet\threads.py", line 122, in
 blockingCallFromThread
    result.raiseException()
  File "<string>", line 2, in raiseException
twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure O
penSSL.SSL.Error: [('SSL routines', 'ssl23_read', 'ssl handshake failure')]>]

redapple commented 8 years ago

Can you provide the full console log? with records of scrapy starting and listing middlewares

How I debug this when I'm stuck is with Wireshark, sniffing what is sent on the wire

PoulTur commented 8 years ago

@redapple this is everything I got in the trace. I just did not paste the command scrapy shell https://shop.clares.co.uk on top. I'm not sure about Wireshark, I checked this tool, but I would not know where to start looking for a solution with it.

redapple commented 8 years ago

@PoulTur , I can't reproduce your issue.

so you don't have the following when you use scrapy shell?

$ scrapy shell https://shop.clares.co.uk
2016-02-11 16:32:41 [scrapy] INFO: Scrapy 1.0.5 started (bot: sslissues)
2016-02-11 16:32:41 [scrapy] INFO: Optional features available: ssl, http11
2016-02-11 16:32:41 [scrapy] INFO: Overridden settings: ...
2016-02-11 16:32:41 [scrapy] INFO: Enabled extensions: ...
2016-02-11 16:32:41 [scrapy] INFO: Enabled downloader middlewares: ...
2016-02-11 16:32:41 [scrapy] INFO: Enabled spider middlewares: ...
2016-02-11 16:32:41 [scrapy] INFO: Enabled item pipelines: 
2016-02-11 16:32:41 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-02-11 16:32:41 [scrapy] INFO: Spider opened
...

I created a repo to report what I did with scrapy shell and what Wireshark saw on the wire: https://github.com/redapple/scrapy-issues/tree/master/1764

For me, https://shop.clares.co.uk worked with TLSv1, TLSv1.1, TLSv1.2, SSLv3 and OpenSSL's SSLv23 (protocol negotiation)

I'm connecting to 195.200.146.176, which is what Google's DNS 8.8.8.8 server returns for "shop.clares.co.uk", on port 443

With Wireshark,

you have to start a capture on the correct interface (your Ethernet card usually): Capture > Interfaces... > select an interface > Start
run scrapy shell https://shop.clares.co.uk, wait for success or failure of download
stop Wireshark capture
Look for DNS query for shop.clares.co.uk (type "dns" in "Filter:" then "apply")
clear filter to see all other (TCP) packets
look for a TCP connection to the IP address from the DNS resolution step above (195.200.146.176 in my case) and on port 443
there should be a ClientHello packet sent from your machine and in your case the TCP connection probably gets torn down, for some reason, afterwards

You can open the different capture files from my tests here: https://github.com/redapple/scrapy-issues/tree/master/1764/pcaps and compare.

PoulTur commented 8 years ago

@redapple thank you for all your help.

I was not seeing the initial Scrapy messages due to custom logging settings. I think the Wireshark and the in-depths of networking protocols might be too much for poor me. The good news is your comments led me to try to switch off the Proxy Mesh and without it I got the spider working now.

I'm not sure if you have some hint for me for my ProxyMesh issue, in the longer run I will probably need this service. I attach the ProxyMesh note below with their statements towards https, which I'm not fully following, how to adjust the connect settings with Scrapy. You were very helpful to me already, I should probably reach out to their support, per their docs it looks like they know Scrapy.

Does the proxy server support HTTPS/SSL sites? Yes. The proxy server itself is still HTTP, but it can securely proxy HTTPS/SSL connections between you and a HTTPS server (using the CONNECT method). All communication between your client/browser and the secure site is encrypted; the proxy server is only moving the data back and forth. The only caveat is that since the proxy server cannot inspect HTTPS requests, all proxy authorization headers or custom ProxyMesh headers must be sent with the initial CONNECT method. IP based authentication is recommended. End-to-end HTTPS support will be added in the future.

natoinet commented 8 years ago

I implemented the context factory but I'm still having issues with some sites.

$ scrapy version -v Scrapy : 1.0.5 lxml : 3.6.0.0 libxml2 : 2.9.0 Twisted : 16.0.0 Python : 2.7.11 (default, Jan 22 2016, 08:28:37) - [GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] pyOpenSSL : 16.0.0 (OpenSSL 0.9.8zg 14 July 2015) Platform : Darwin-14.5.0-x86_64-i386-64bit

https://shop.clares.co.uk/ worked before implementing the context factory. After implementing it, scrapy shell also works with: https://www.buyagift.co.uk However, it fails with https://revonsunpeu.net, even after I upgraded openssl with brew.

$ scrapy shell https://revonsunpeu.net 2016-03-24 20:07:07 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapy_googleindex) 2016-03-24 20:07:07 [scrapy] INFO: Optional features available: ssl, http11 2016-03-24 20:07:07 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'scrapy_googleindex.spiders', 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'SPIDER_MODULES': ['scrapy_googleindex.spiders'], 'BOT_NAME': 'scrapy_googleindex', 'LOGSTATS_INTERVAL': 0, 'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10; rv:44.0) Gecko/20100101 Firefox/44.0', 'DOWNLOAD_DELAY': 0.5, 'DOWNLOADER_CLIENTCONTEXTFACTORY': 'scrapy_googleindex.contextfactory.TLSFlexibleContextFactory'} 2016-03-24 20:07:08 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, CoreStats, SpiderState 2016-03-24 20:07:08 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 2016-03-24 20:07:08 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2016-03-24 20:07:08 [scrapy] INFO: Enabled item pipelines: 2016-03-24 20:07:08 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 2016-03-24 20:07:08 [scrapy] INFO: Spider opened 2016-03-24 20:07:08 [scrapy] DEBUG: Retrying <GET https://revonsunpeu.net> (failed 1 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL23_GET_SERVER_HELLO', 'sslv3 alert handshake failure')]>] 2016-03-24 20:07:08 [scrapy] DEBUG: Retrying <GET https://revonsunpeu.net> (failed 2 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL23_GET_SERVER_HELLO', 'sslv3 alert handshake failure')]>] 2016-03-24 20:07:09 [scrapy] DEBUG: Gave up retrying <GET https://revonsunpeu.net> (failed 3 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL23_GET_SERVER_HELLO', 'sslv3 alert handshake failure')]>] Traceback (most recent call last): File "/Users/thatsme/.virtualenvs/scraper/bin/scrapy", line 11, in sys.exit(execute()) File "/Users/thatsme/.virtualenvs/scraper/lib/python2.7/site-packages/scrapy/cmdline.py", line 143, in execute _run_print_help(parser, _run_command, cmd, args, opts) File "/Users/thatsme/.virtualenvs/scraper/lib/python2.7/site-packages/scrapy/cmdline.py", line 89, in _run_print_help func(_a, *_kw) File "/Users/thatsme/.virtualenvs/scraper/lib/python2.7/site-packages/scrapy/cmdline.py", line 150, in _run_command cmd.run(args, opts) File "/Users/thatsme/.virtualenvs/scraper/lib/python2.7/site-packages/scrapy/commands/shell.py", line 67, in run shell.start(url=url) File "/Users/thatsme/.virtualenvs/scraper/lib/python2.7/site-packages/scrapy/shell.py", line 44, in start self.fetch(url, spider) File "/Users/thatsme/.virtualenvs/scraper/lib/python2.7/site-packages/scrapy/shell.py", line 87, in fetch reactor, self._schedule, request, spider) File "/Users/thatsme/.virtualenvs/scraper/lib/python2.7/site-packages/twisted/internet/threads.py", line 122, in blockingCallFromThread result.raiseException() File "", line 2, in raiseException twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL23_GET_SERVER_HELLO', 'sslv3 alert handshake failure')]>]

And the following seems to work well:

$ openssl s_client -connect revonsunpeu.net:443 -servername revonsunpeu.net

Apparently, it fails with websites using cloudflare ssl option without SSL certificate on the server (https://www.cloudflare.com/ssl/) Any idea how to scrape these kind of websites?

redapple commented 8 years ago

@natoinet , can I ask you test with scrapy 1.1 RC3? (pip install scrapy==1.1.0rc3 in a new virualenv for example) In 1.1, we follow TLS settings recommendations from Twisted more closely.

It works for me on all 3 sites you mention, although I'm on Ubuntu so lib versions differ. See below:

(scrapy11rc3.py27) $ scrapy version -v
Scrapy    : 1.1.0rc3
lxml      : 3.5.0.0
libxml2   : 2.9.2
Twisted   : 15.5.0
Python    : 2.7.10 (default, Oct 14 2015, 16:09:02) - [GCC 5.2.1 20151010]
pyOpenSSL : 0.15.1 (OpenSSL 1.0.2d 9 Jul 2015)
Platform  : Linux-4.2.0-34-generic-x86_64-with-Ubuntu-15.10-wily

redapple commented 8 years ago

https://revonsunpeu.net/ , https://www.buyagift.co.uk/ and https://shop.clares.co.uk/ worked for me with Scrapy 1.1 RC3 with updated Twisted (16.0) and PyOpenSSL (16.0), under Python 2.7 and Python 3.4

(scrapy11rc3.py34) ~$ scrapy version -v
Scrapy    : 1.1.0rc3
lxml      : 3.5.0.0
libxml2   : 2.9.2
Twisted   : 16.0.0
Python    : 3.4.3+ (default, Oct 14 2015, 16:03:50) - [GCC 5.2.1 20151010]
pyOpenSSL : 16.0.0 (OpenSSL 1.0.2d 9 Jul 2015)
Platform  : Linux-4.2.0-34-generic-x86_64-with-Ubuntu-15.10-wily

natoinet commented 8 years ago

@redapple Thanks for your reply: I tried it with scrapy 1.1.0rc3 over python 2.7 & python 3.4. It works without TLSFlexibleContextFactory for https://www.buyagift.co.uk/ and https://shop.clares.co.uk/ However it does not work for https://revonsunpeu.net/, with or without TLSFlexibleContextFactory. Any idea?

(scrapy1.1rc3p3)HeyHeyHey:scrapy_googleindex thatsme $ scrapy version -v
Scrapy    : 1.1.0rc3
lxml      : 3.6.0.0
libxml2   : 2.9.0
Twisted   : 16.0.0
Python    : 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 00:54:21) - [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
pyOpenSSL : 16.0.0 (OpenSSL 0.9.8zg 14 July 2015)
Platform  : Darwin-14.5.0-x86_64-i386-64bit

(scrapy1.1rc3p3)HeyHeyHey:scrapy_googleindex thatsme$ scrapy shell https://revonsunpeu.net
2016-03-27 03:44:24 [scrapy] INFO: Scrapy 1.1.0rc3 started (bot: scrapy_googleindex)
2016-03-27 03:44:24 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'BOT_NAME': 'scrapy_googleindex', 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10; rv:44.0) Gecko/20100101 Firefox/44.0', 'DOWNLOAD_DELAY': 0.5, 'NEWSPIDER_MODULE': 'scrapy_googleindex.spiders', 'SPIDER_MODULES': ['scrapy_googleindex.spiders']}
2016-03-27 03:44:24 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats']
2016-03-27 03:44:24 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-03-27 03:44:24 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-03-27 03:44:24 [scrapy] INFO: Enabled item pipelines:
[]
2016-03-27 03:44:24 [scrapy] INFO: Spider opened
2016-03-27 03:44:24 [scrapy] DEBUG: Retrying <GET https://revonsunpeu.net> (failed 1 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL23_GET_SERVER_HELLO', 'sslv3 alert handshake failure')]>]
2016-03-27 03:44:25 [scrapy] DEBUG: Retrying <GET https://revonsunpeu.net> (failed 2 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL23_GET_SERVER_HELLO', 'sslv3 alert handshake failure')]>]
2016-03-27 03:44:26 [scrapy] DEBUG: Gave up retrying <GET https://revonsunpeu.net> (failed 3 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL23_GET_SERVER_HELLO', 'sslv3 alert handshake failure')]>]
Traceback (most recent call last):
  File "/Users/thatsme/.virtualenvs/scrapy1.1rc3p3/bin/scrapy", line 11, in <module>
    sys.exit(execute())
  File "/Users/thatsme/.virtualenvs/scrapy1.1rc3p3/lib/python3.4/site-packages/scrapy/cmdline.py", line 142, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/Users/thatsme/.virtualenvs/scrapy1.1rc3p3/lib/python3.4/site-packages/scrapy/cmdline.py", line 88, in _run_print_help
    func(*a, **kw)
  File "/Users/thatsme/.virtualenvs/scrapy1.1rc3p3/lib/python3.4/site-packages/scrapy/cmdline.py", line 149, in _run_command
    cmd.run(args, opts)
  File "/Users/thatsme/.virtualenvs/scrapy1.1rc3p3/lib/python3.4/site-packages/scrapy/commands/shell.py", line 71, in run
    shell.start(url=url)
  File "/Users/thatsme/.virtualenvs/scrapy1.1rc3p3/lib/python3.4/site-packages/scrapy/shell.py", line 47, in start
    self.fetch(url, spider)
  File "/Users/thatsme/.virtualenvs/scrapy1.1rc3p3/lib/python3.4/site-packages/scrapy/shell.py", line 112, in fetch
    reactor, self._schedule, request, spider)
  File "/Users/thatsme/.virtualenvs/scrapy1.1rc3p3/lib/python3.4/site-packages/twisted/internet/threads.py", line 122, in blockingCallFromThread
    result.raiseException()
  File "/Users/thatsme/.virtualenvs/scrapy1.1rc3p3/lib/python3.4/site-packages/twisted/python/failure.py", line 368, in raiseException
    raise self.value.with_traceback(self.tb)
twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL23_GET_SERVER_HELLO', 'sslv3 alert handshake failure')]>]

redapple commented 8 years ago

I tried without changing context factory. Scrapy 1.1rc3 uses better defaults, so you shouldn't need to tweak anything. Can you try with out-of-the-box scrapy 1.1rc3? (I used scrapy shell https://...) Le 27 mars 2016 03:51, "@ntoinet" notifications@github.com a écrit :

@redapple https://github.com/redapple Thanks for your reply: I tried it with scrapy 1.1.0rc3 over python 2.7 & python 3.4. It works without TLSFlexibleContextFactory for https://www.buyagift.co.uk/ and https://shop.clares.co.uk/ However it does not work for https://revonsunpeu.net/, with or without TLSFlexibleContextFactory.

(scrapy1.1rc3p3)HeyHeyHey:scrapy_googleindex thatsme $ scrapy version -v Scrapy : 1.1.0rc3 lxml : 3.6.0.0 libxml2 : 2.9.0 Twisted : 16.0.0 Python : 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 00:54:21) - [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] pyOpenSSL : 16.0.0 (OpenSSL 0.9.8zg 14 July 2015) Platform : Darwin-14.5.0-x86_64-i386-64bit

(scrapy1.1rc3p3)HeyHeyHey:scrapy_googleindex thatsme$ scrapy shell https://revonsunpeu.net 2016-03-27 03:44:24 [scrapy] INFO: Scrapy 1.1.0rc3 started (bot: scrapy_googleindex) 2016-03-27 03:44:24 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'BOT_NAME': 'scrapy_googleindex', 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10; rv:44.0) Gecko/20100101 Firefox/44.0', 'DOWNLOAD_DELAY': 0.5, 'NEWSPIDER_MODULE': 'scrapy_googleindex.spiders', 'SPIDER_MODULES': ['scrapy_googleindex.spiders']} 2016-03-27 03:44:24 [scrapy] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats'] 2016-03-27 03:44:24 [scrapy] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2016-03-27 03:44:24 [scrapy] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2016-03-27 03:44:24 [scrapy] INFO: Enabled item pipelines: [] 2016-03-27 03:44:24 [scrapy] INFO: Spider opened 2016-03-27 03:44:24 [scrapy] DEBUG: Retrying <GET https://revonsunpeu.net> (failed 1 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL23_GET_SERVER_HELLO', 'sslv3 alert handshake failure')]>] 2016-03-27 03:44:25 [scrapy] DEBUG: Retrying <GET https://revonsunpeu.net> (failed 2 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL23_GET_SERVER_HELLO', 'sslv3 alert handshake failure')]>] 2016-03-27 03:44:26 [scrapy] DEBUG: Gave up retrying <GET https://revonsunpeu.net> (failed 3 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL23_GET_SERVER_HELLO', 'sslv3 alert handshake failure')]>] Traceback (most recent call last): File "/Users/thatsme/.virtualenvs/scrapy1.1rc3p3/bin/scrapy", line 11, in sys.exit(execute()) File "/Users/thatsme/.virtualenvs/scrapy1.1rc3p3/lib/python3.4/site-packages/scrapy/cmdline.py", line 142, in execute _run_print_help(parser, _run_command, cmd, args, opts) File "/Users/thatsme/.virtualenvs/scrapy1.1rc3p3/lib/python3.4/site-packages/scrapy/cmdline.py", line 88, in _run_print_help func(_a, *_kw) File "/Users/thatsme/.virtualenvs/scrapy1.1rc3p3/lib/python3.4/site-packages/scrapy/cmdline.py", line 149, in _run_command cmd.run(args, opts) File "/Users/thatsme/.virtualenvs/scrapy1.1rc3p3/lib/python3.4/site-packages/scrapy/commands/shell.py", line 71, in run shell.start(url=url) File "/Users/thatsme/.virtualenvs/scrapy1.1rc3p3/lib/python3.4/site-packages/scrapy/shell.py", line 47, in start self.fetch(url, spider) File "/Users/thatsme/.virtualenvs/scrapy1.1rc3p3/lib/python3.4/site-packages/scrapy/shell.py", line 112, in fetch reactor, self._schedule, request, spider) File "/Users/thatsme/.virtualenvs/scrapy1.1rc3p3/lib/python3.4/site-packages/twisted/internet/threads.py", line 122, in blockingCallFromThread result.raiseException() File "/Users/thatsme/.virtualenvs/scrapy1.1rc3p3/lib/python3.4/site-packages/twisted/python/failure.py", line 368, in raiseException raise self.value.with_traceback(self.tb) twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL23_GET_SERVER_HELLO', 'sslv3 alert handshake failure')]>]

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/scrapy/scrapy/issues/1764#issuecomment-201971036

redapple commented 8 years ago

Sorry, that's what you did, I misread. Maybe someone on Mac OS can test this. @kmike maybe? Le 27 mars 2016 13:32, "Paul Tremberth" paul.tremberth@gmail.com a écrit :

I tried without changing context factory. Scrapy 1.1rc3 uses better defaults, so you shouldn't need to tweak anything. Can you try with out-of-the-box scrapy 1.1rc3? (I used scrapy shell https://...) Le 27 mars 2016 03:51, "@ntoinet" notifications@github.com a écrit :

@redapple https://github.com/redapple Thanks for your reply: I tried it with scrapy 1.1.0rc3 over python 2.7 & python 3.4. It works without TLSFlexibleContextFactory for https://www.buyagift.co.uk/ and https://shop.clares.co.uk/ However it does not work for https://revonsunpeu.net/, with or without TLSFlexibleContextFactory.

(scrapy1.1rc3p3)HeyHeyHey:scrapy_googleindex thatsme $ scrapy version -v Scrapy : 1.1.0rc3 lxml : 3.6.0.0 libxml2 : 2.9.0 Twisted : 16.0.0 Python : 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 00:54:21) - [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] pyOpenSSL : 16.0.0 (OpenSSL 0.9.8zg 14 July 2015) Platform : Darwin-14.5.0-x86_64-i386-64bit

(scrapy1.1rc3p3)HeyHeyHey:scrapy_googleindex thatsme$ scrapy shell https://revonsunpeu.net 2016-03-27 03:44:24 [scrapy] INFO: Scrapy 1.1.0rc3 started (bot: scrapy_googleindex) 2016-03-27 03:44:24 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'BOT_NAME': 'scrapy_googleindex', 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10; rv:44.0) Gecko/20100101 Firefox/44.0', 'DOWNLOAD_DELAY': 0.5, 'NEWSPIDER_MODULE': 'scrapy_googleindex.spiders', 'SPIDER_MODULES': ['scrapy_googleindex.spiders']} 2016-03-27 03:44:24 [scrapy] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats'] 2016-03-27 03:44:24 [scrapy] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2016-03-27 03:44:24 [scrapy] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2016-03-27 03:44:24 [scrapy] INFO: Enabled item pipelines: [] 2016-03-27 03:44:24 [scrapy] INFO: Spider opened 2016-03-27 03:44:24 [scrapy] DEBUG: Retrying <GET https://revonsunpeu.net> (failed 1 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL23_GET_SERVER_HELLO', 'sslv3 alert handshake failure')]>] 2016-03-27 03:44:25 [scrapy] DEBUG: Retrying <GET https://revonsunpeu.net> (failed 2 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL23_GET_SERVER_HELLO', 'sslv3 alert handshake failure')]>] 2016-03-27 03:44:26 [scrapy] DEBUG: Gave up retrying <GET https://revonsunpeu.net> (failed 3 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL23_GET_SERVER_HELLO', 'sslv3 alert handshake failure')]>] Traceback (most recent call last): File "/Users/thatsme/.virtualenvs/scrapy1.1rc3p3/bin/scrapy", line 11, in sys.exit(execute()) File "/Users/thatsme/.virtualenvs/scrapy1.1rc3p3/lib/python3.4/site-packages/scrapy/cmdline.py", line 142, in execute _run_print_help(parser, _run_command, cmd, args, opts) File "/Users/thatsme/.virtualenvs/scrapy1.1rc3p3/lib/python3.4/site-packages/scrapy/cmdline.py", line 88, in _run_print_help func(_a, *_kw) File "/Users/thatsme/.virtualenvs/scrapy1.1rc3p3/lib/python3.4/site-packages/scrapy/cmdline.py", line 149, in _run_command cmd.run(args, opts) File "/Users/thatsme/.virtualenvs/scrapy1.1rc3p3/lib/python3.4/site-packages/scrapy/commands/shell.py", line 71, in run shell.start(url=url) File "/Users/thatsme/.virtualenvs/scrapy1.1rc3p3/lib/python3.4/site-packages/scrapy/shell.py", line 47, in start self.fetch(url, spider) File "/Users/thatsme/.virtualenvs/scrapy1.1rc3p3/lib/python3.4/site-packages/scrapy/shell.py", line 112, in fetch reactor, self._schedule, request, spider) File "/Users/thatsme/.virtualenvs/scrapy1.1rc3p3/lib/python3.4/site-packages/twisted/internet/threads.py", line 122, in blockingCallFromThread result.raiseException() File "/Users/thatsme/.virtualenvs/scrapy1.1rc3p3/lib/python3.4/site-packages/twisted/python/failure.py", line 368, in raiseException raise self.value.with_traceback(self.tb) twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL23_GET_SERVER_HELLO', 'sslv3 alert handshake failure')]>]

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/scrapy/scrapy/issues/1764#issuecomment-201971036

natoinet commented 8 years ago

@redapple @kmike I just updated to the latest version of osx and the result is identical. Also, when, I check openssl version, the result is:

$ openssl version
OpenSSL 1.0.2g  1 Mar 2016

natoinet commented 8 years ago

@redapple You are right, I checked on Debian Wheezy with python 2.7 & Scrapy 1.1.0rc3 and it works fine. It's only on OSX that it does not work with https://revonsunpeu.net. I don't know if anyone else can confirm this?

kneufeld commented 8 years ago

I can confirm that 1.1.0rc3 does not work on OSX (it no longer throws an exception so at least there's progress). I'm almost positive that the issue is with the openssl version as it's 0.9.8zg and on linux it's using a newer 1.0.2 version. I've updated openssl via homebrew and tried to reinstall pyopenssl via:

rm -rf ~/Library/Caches/pip
pip uninstall cryptography
ARCHFLAGS="-arch x86_64" LDFLAGS="-L$(brew --prefix openssl)/lib" CFLAGS="-I$(brew --prefix openssl)/include" pip install cryptography

but it didn't seem to use the newer homebrew version. I may have done something else stupid though.

redapple commented 8 years ago

@kneufeld , @natoinet , is any of you able to sniff the TCP traffic when establishing the connection? (with Wireshark or tcpdump for example) we can then compare TLS options that are negotiated between the endpoints.

redapple commented 8 years ago

@kneufeld , when you say

I can confirm that 1.1.0rc3 does not work on OSX

you mean with https://revonsunpeu.net/? or in general?

kneufeld commented 8 years ago

EDIT: go down two posts before wasting time

$ scrapy shell  https://revonsunpeu.net/
2016-04-06 16:39:26 [scrapy] INFO: Scrapy 1.1.0rc3 started (bot: scrapybot)
2016-04-06 16:39:26 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'}
2016-04-06 16:39:26 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-04-06 16:39:26 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-04-06 16:39:26 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-04-06 16:39:26 [scrapy] INFO: Enabled item pipelines:
[]
2016-04-06 16:39:26 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-04-06 16:39:26 [scrapy] INFO: Spider opened
2016-04-06 16:39:26 [scrapy] DEBUG: Retrying <GET https://revonsunpeu.net/> (failed 1 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL23_GET_SERVER_HELLO', 'sslv3 alert handshake failure')]>]
2016-04-06 16:39:26 [scrapy] DEBUG: Retrying <GET https://revonsunpeu.net/> (failed 2 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL23_GET_SERVER_HELLO', 'sslv3 alert handshake failure')]>]
2016-04-06 16:39:26 [scrapy] DEBUG: Gave up retrying <GET https://revonsunpeu.net/> (failed 3 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL23_GET_SERVER_HELLO', 'sslv3 alert handshake failure')]>]
Traceback (most recent call last):
  File "/Users/kneufeld/.virtualenvs/devgrabber/bin/scrapy", line 11, in <module>
    sys.exit(execute())
  File "/Users/kneufeld/.virtualenvs/devgrabber/lib/python2.7/site-packages/scrapy/cmdline.py", line 142, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/Users/kneufeld/.virtualenvs/devgrabber/lib/python2.7/site-packages/scrapy/cmdline.py", line 88, in _run_print_help
    func(*a, **kw)
  File "/Users/kneufeld/.virtualenvs/devgrabber/lib/python2.7/site-packages/scrapy/cmdline.py", line 149, in _run_command
    cmd.run(args, opts)
  File "/Users/kneufeld/.virtualenvs/devgrabber/lib/python2.7/site-packages/scrapy/commands/shell.py", line 71, in run
    shell.start(url=url)
  File "/Users/kneufeld/.virtualenvs/devgrabber/lib/python2.7/site-packages/scrapy/shell.py", line 47, in start
    self.fetch(url, spider)
  File "/Users/kneufeld/.virtualenvs/devgrabber/lib/python2.7/site-packages/scrapy/shell.py", line 112, in fetch
    reactor, self._schedule, request, spider)
  File "/Users/kneufeld/.virtualenvs/devgrabber/lib/python2.7/site-packages/twisted/internet/threads.py", line 122, in blockingCallFromThread
    result.raiseException()
  File "<string>", line 2, in raiseException
twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL23_GET_SERVER_HELLO', 'sslv3 alert handshake failure')]>]

$ scrapy version -v
Scrapy    : 1.1.0rc3
lxml      : 3.6.0.0
libxml2   : 2.9.0
Twisted   : 16.1.0
Python    : 2.7.11 (default, Dec 26 2015, 17:47:15) - [GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)]
pyOpenSSL : 16.0.0 (OpenSSL 0.9.8zg 14 July 2015)
Platform  : Darwin-14.5.0-x86_64-i386-64bit

wireshark trace forthcoming

kneufeld commented 8 years ago

tls_handshake.pcap.zip

kneufeld commented 8 years ago

Well I have no idea what's going on now.

I've uninstalled cryptography and pyopenssl and reinstalled them with and without brew prefix flags and pOpenSSL is linking to the newer homebrew ssl versions and everything is working.

So, to anybody who's hitting this bug:

scrapy version -v | grep pyOpenSSL

will return one of the following:

pyOpenSSL : 16.0.0 (OpenSSL 1.0.2g  1 Mar 2016)
pyOpenSSL : 16.0.0 (OpenSSL 0.9.8zg 14 July 2015)

If it's OpenSSL version 0.9.8 then you need to try again.

rm -rf ~/Library/Caches/pip
pip uninstall cryptography pyopenssl
env LDFLAGS="-L$(brew --prefix openssl)/lib" CFLAGS="-I$(brew --prefix openssl)/include" pip install cryptography pyopenssl

So basically you can probably ignore my previous two posts.

natoinet commented 8 years ago

Following @kneufeld recommendation, after doing the following, scrapy shell https://revonsunpeu.net works on OSX 10.11.4:

$ brew link openssl --force
$ mkvirtualenv --python=`which python3` nameofenvironment
$ env LDFLAGS="-L$(brew --prefix openssl)/lib" CFLAGS="-I$(brew --prefix openssl)/include" pip install cryptography pyopenssl
$ pip install Scrapy==1.1.0rc3

Thanks!

redapple commented 8 years ago

@kneufeld , @natoinet , great to hear that! This warrants an entry in the FAQ perhaps. Would any of you want to contribute to the docs about this OpenSSL on OSX?

kneufeld commented 8 years ago

Here ya go: https://github.com/scrapy/scrapy/pull/1909

redapple commented 8 years ago

thanks @kneufeld

natoinet commented 8 years ago

Thanks @kneufeld !

redapple commented 8 years ago

@PoulTur , as you're using ProxyMesh, you may be interested in testing https://github.com/scrapy/scrapy/pull/1938 Current master branch has an issue accessing https websites requiring SNI through proxies.

redapple commented 7 years ago

Now that #1794 is merged, I'm closing this for the original issue about TLS protocol version negotiation. The remaining concerns with OSX are to be covered with a FAQ entry (at least): https://github.com/scrapy/scrapy/pull/1909