scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

TLS handshake failure #2717

Closed povilasb closed 4 years ago

povilasb commented 7 years ago

I have this simple spider:

import scrapy

class FailingSpider(scrapy.Spider):
    name = 'Failing Spider'
    start_urls = ['https://www.skelbiu.lt/']

    def parse(self, response: scrapy.http.Response) -> None:
        pass

On debian 9 it fails with:

2017-04-25 19:01:39 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.skelbiu.lt/>
Traceback (most recent call last):
  File "/home/povilas/projects/skelbiu-scraper/pyenv/lib/python3.6/site-packages/twisted/internet/defer.py", line 1299, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/home/povilas/projects/skelbiu-scraper/pyenv/lib/python3.6/site-packages/twisted/python/failure.py", line 393, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/home/povilas/projects/skelbiu-scraper/pyenv/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))
twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'ssl3_read_bytes', 'sslv3 alert handshake failure')]>]

On debian 8 it works well. And "https://www.skelbiu.lt" is the only target I can reproduce the problem.

Some more context:

$ pyenv/bin/pip freeze
asn1crypto==0.22.0
attrs==16.3.0
Automat==0.5.0
cffi==1.10.0
constantly==15.1.0
cryptography==1.8.1
cssselect==1.0.1
funcsigs==0.4
idna==2.5
incremental==16.10.1
lxml==3.7.3
mock==1.3.0
packaging==16.8
parsel==1.1.0
pbr==3.0.0
py==1.4.33
pyasn1==0.2.3
pyasn1-modules==0.0.8
pycparser==2.17
PyDispatcher==2.0.5
PyHamcrest==1.8.5
pyOpenSSL==17.0.0
pyparsing==2.2.0
pytest==2.7.2
queuelib==1.4.2
Scrapy==1.3.3
service-identity==16.0.0
six==1.10.0
Twisted==17.1.0
w3lib==1.17.0
zope.interface==4.4.0

$ dpkg --get-selections | grep libssl
libssl-dev:amd64                                install
libssl-doc                                      install
libssl1.0.2:amd64                               install
libssl1.1:amd64                                 install
libssl1.1:i386                                  install

$ apt-cache show libssl1.1
Package: libssl1.1
Source: openssl
Version: 1.1.0e-1

Any ideas what I should look for? :)

povilasb commented 7 years ago

My hypothesis is that the server rejects TLS client hello because of some specified ciphers:

Cipher Suites (28 suites)
    Cipher Suite: TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384 (0xc02c)
    Cipher Suite: TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (0xc030)
    Cipher Suite: TLS_DHE_RSA_WITH_AES_256_GCM_SHA384 (0x009f)
    Cipher Suite: TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256 (0xcca9)
    Cipher Suite: TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256 (0xcca8)
    Cipher Suite: TLS_DHE_RSA_WITH_CHACHA20_POLY1305_SHA256 (0xccaa)
    Cipher Suite: TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256 (0xc02b)
    Cipher Suite: TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (0xc02f)
    Cipher Suite: TLS_DHE_RSA_WITH_AES_128_GCM_SHA256 (0x009e)
    Cipher Suite: TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA384 (0xc024)
    Cipher Suite: TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384 (0xc028)
    Cipher Suite: TLS_DHE_RSA_WITH_AES_256_CBC_SHA256 (0x006b)
    Cipher Suite: TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256 (0xc023)
    Cipher Suite: TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256 (0xc027)
    Cipher Suite: TLS_DHE_RSA_WITH_AES_128_CBC_SHA256 (0x0067)
    Cipher Suite: TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA (0xc00a)
    Cipher Suite: TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA (0xc014)
    Cipher Suite: TLS_DHE_RSA_WITH_AES_256_CBC_SHA (0x0039)
    Cipher Suite: TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA (0xc009)
    Cipher Suite: TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA (0xc013)
    Cipher Suite: TLS_DHE_RSA_WITH_AES_128_CBC_SHA (0x0033)
    Cipher Suite: TLS_RSA_WITH_AES_256_GCM_SHA384 (0x009d)
    Cipher Suite: TLS_RSA_WITH_AES_128_GCM_SHA256 (0x009c)
    Cipher Suite: TLS_RSA_WITH_AES_256_CBC_SHA256 (0x003d)
    Cipher Suite: TLS_RSA_WITH_AES_128_CBC_SHA256 (0x003c)
    Cipher Suite: TLS_RSA_WITH_AES_256_CBC_SHA (0x0035)
    Cipher Suite: TLS_RSA_WITH_AES_128_CBC_SHA (0x002f)
    Cipher Suite: TLS_EMPTY_RENEGOTIATION_INFO_SCSV (0x00ff)

Wireshark displays me this response from the server:

TLSv1.2 Record Layer: Alert (Level: Fatal, Description: Handshake Failure)
    Content Type: Alert (21)
    Version: TLS 1.2 (0x0303)
    Length: 2
    Alert Message
        Level: Fatal (2)
        Description: Handshake Failure (40)

It comes immediately after TLS client hello message.

kmike commented 7 years ago

@redapple is the man who knows everything about such issues, but have you tried setting a different DOWNLOADER_CLIENT_TLS_METHOD option value?

povilasb commented 7 years ago

Unfortunately, changing TLS version does not help.

redapple commented 7 years ago

I think you're on the right track with cipher suites. Did you compare ClientHello requests for success and failure cases? I cannot reproduce it with that URL but I have an older openssl. I'll try and use a more recent one tomorrow.

Le 25 avr. 2017 22:03, "Povilas Balciunas" notifications@github.com a écrit :

Unfortunately, changing TLS version does not help.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scrapy/scrapy/issues/2717#issuecomment-297148497, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2GGE9OGAcs4VwCcJR3gcgt_vJO1Qepks5rzlGsgaJpZM4NHuwX .

povilasb commented 7 years ago

How do you make scrapy/python to choose specific openssl version?

redapple commented 7 years ago

I haven't tried it yet myself but I believe you can use https://cryptography.io/en/latest/installation/#static-wheels

I was planning on using an Debian 9 Sid docker image.

redapple commented 7 years ago

Alright, I just tried https://github.com/scrapy/scrapy/issues/2717#issuecomment-297404774 and I was able to reproduce the issue:

$ scrapy version -v
Scrapy    : 1.3.3
lxml      : 3.7.3.0
libxml2   : 2.9.3
cssselect : 1.0.1
parsel    : 1.1.0
w3lib     : 1.17.0
Twisted   : 17.1.0
Python    : 2.7.12+ (default, Sep 17 2016, 12:08:02) - [GCC 6.2.0 20160914]
pyOpenSSL : 17.0.0 (OpenSSL 1.1.0e  16 Feb 2017)
Platform  : Linux-4.8.0-49-generic-x86_64-with-Ubuntu-16.10-yakkety

$ cat testssl.py
import scrapy

class FailingSpider(scrapy.Spider):
    name = 'Failing Spider'
    start_urls = ['https://www.skelbiu.lt/']

    def parse(self, response):
        pass

$ scrapy runspider testssl.py 
2017-04-26 15:45:18 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)
2017-04-26 15:45:18 [scrapy.utils.log] INFO: Overridden settings: {'SPIDER_LOADER_WARN_ONLY': True}
(...)
2017-04-26 15:45:18 [scrapy.core.engine] INFO: Spider opened
2017-04-26 15:45:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-04-26 15:45:19 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-04-26 15:45:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.skelbiu.lt/> (failed 1 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'ssl3_read_bytes', 'sslv3 alert handshake failure')]>]
2017-04-26 15:45:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.skelbiu.lt/> (failed 2 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'ssl3_read_bytes', 'sslv3 alert handshake failure')]>]
2017-04-26 15:45:19 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://www.skelbiu.lt/> (failed 3 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'ssl3_read_bytes', 'sslv3 alert handshake failure')]>]
2017-04-26 15:45:19 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.skelbiu.lt/>: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'ssl3_read_bytes', 'sslv3 alert handshake failure')]>]
2017-04-26 15:45:19 [scrapy.core.engine] INFO: Closing spider (finished)
2017-04-26 15:45:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
 'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 3,
 'downloader/request_bytes': 636,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 4, 26, 13, 45, 19, 855881),
 'log_count/DEBUG': 4,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2017, 4, 26, 13, 45, 19, 1654)}
2017-04-26 15:45:19 [scrapy.core.engine] INFO: Spider closed (finished)
redapple commented 7 years ago

For the record, I've collected .pcap files and expanded ClientHello message for Scrapy and OpenSSL client 1.0.2g and 1.1.0e in https://github.com/redapple/scrapy-issues/tree/master/2717

I'm leaning towards something to do with Elliptic Curves. I'll keep you updated.

redapple commented 7 years ago

Yeah, it looks like an EC thing:

Now, I'll have a look at how to properly configure this with Twisted Agent.

redapple commented 7 years ago

From what I see on https://www.ssllabs.com/ssltest/analyze.html?d=www.skelbiu.lt&s=92.62.130.22&hideResults=on, the website indeed requires (at least?) "secp384r1", which I tested in https://github.com/scrapy/scrapy/issues/2717#issuecomment-297440829

By default, openssl 1.1.0e client sends:

                Elliptic curves (4 curves)
                    Elliptic curve: ecdh_x25519 (0x001d)
                    Elliptic curve: secp256r1 (0x0017)
                    Elliptic curve: secp521r1 (0x0019)
                    Elliptic curve: secp384r1 (0x0018)

but Scrapy1.3.3/Twisted 17.1 with OpenSSL 1.1.0e only sends:

                Elliptic curves (1 curve)
                    Elliptic curve: secp256r1 (0x0017)

The code in Twisted using _defaultCurveName = u"prime256v1" was added 3 years ago apparently. Maybe OpenSSL now uses the setting. I'm not sure.

A couple of (non-exclusive) options :

redapple commented 7 years ago

fyi, I've sent a message on Twisted Web mailing list: https://twistedmatrix.com/pipermail/twisted-web/2017-April/005293.html

redapple commented 6 years ago

I just tested with Twisted 17.5.0rc2 and this does NOT look fixed.

felixonmars commented 6 years ago

For me the issue is https://bugs.python.org/issue29697

The patch date is after all python stable versions and it causes the same error for urllib2.urlopen for python 2.7 here. Applying the patch in that issue fixes it for me.

redapple commented 6 years ago

Twisted bug: https://twistedmatrix.com/trac/ticket/9210 (I had not opened it at the time)

jsakars commented 6 years ago

I'm having the same issue with following versions:

Scrapy    : 1.4.0
lxml      : 3.8.0.0
libxml2   : 2.9.4
cssselect : 1.0.1
parsel    : 1.2.0
w3lib     : 1.17.0
Twisted   : 17.5.0
Python    : 3.6.0 (v3.6.0:41df79263a11, Dec 22 2016, 17:23:13) - [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
pyOpenSSL : 17.1.0 (OpenSSL 1.1.0f  25 May 2017)
Platform  : Darwin-16.6.0-x86_64-i386-64bit

Is there a workaround?

redapple commented 6 years ago

@werdlv , I don't know any workaround. Can you comment on which website is showing this failure? (to check if it's indeed related to OpenSSL 1.1 with Twisted)

jsakars commented 6 years ago

@redapple sure. At least these are giving SSL error:

  1. https://www.cvbankas.lt/
  2. https://www.skelbiu.lt/

Here are some that are working without errors:

  1. https://www.cvmarket.lt/
  2. https://www.alio.lt/
  3. https://cvzona.lt/
redapple commented 6 years ago

Thanks @werdlv . So it appears that https://www.skelbiu.lt/ and https://www.cvbankas.lt/ are served by the same machines 92.62.130.22 and 92.62.130.23. https://www.skelbiu.lt/ is the host in this very issue (https://github.com/scrapy/scrapy/issues/2717#issue-224196154)

tonal commented 6 years ago

also error site https://www.teplodvor.ru/

tonal commented 6 years ago

see also #2944

redapple commented 6 years ago

Right @tonal . https://www.teplodvor.ru/ does not look compatible with OpenSSL 1.1 (some weak ciphers were removed). Downgrading to cryptography<2 , which ships with OpenSSL 1.0.2 (at least for me on Ubuntu), makes it work.

sulangsss commented 6 years ago

@redapple I have run pip install --upgrade 'cryptography<2', but not work

url: https://www.archdaily.com

Scrapy : 1.4.0 lxml : 4.1.1.0 libxml2 : 2.9.7 cssselect : 1.0.1 parsel : 1.2.0 w3lib : 1.18.0 Twisted : 17.9.0 Python : 3.6.3 (default, Oct 24 2017, 14:48:20) - [GCC 7.2.0] pyOpenSSL : 17.5.0 (OpenSSL 1.1.0g 2 Nov 2017) Platform : Linux-4.9.66-1-MANJARO-x86_64-with-arch-Manjaro-Linux

<GET https://www.archdaily.com> 2017-12-10 16:14:21 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.archdaily.com> (failed 1 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'ssl3_read_bytes', 'sslv3 alert handshake failure')]>] 2017-12-10 16:14:26 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.archdaily.com> (failed 2 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'ssl3_read_bytes', 'sslv3 alert handshake failure')]>] 2017-12-10 16:14:27 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://www.archdaily.com> (failed 3 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'ssl3_read_bytes', 'sslv3 alert handshake failure')]>] 2017-12-10 16:14:27 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.archdaily.com>: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'ssl3_read_bytes', 'sslv3 alert handshake failure')]>]

raphapassini commented 6 years ago

@sulangsss seems that you are still using OpenSSL 1.1.0. pyOpenSSL : 17.5.0 (OpenSSL 1.1.0g 2 Nov 2017) try to install OpenSSL == 1.0.x

Laruxo commented 6 years ago

I just installed Twisted==18.4.0rc1 and www.skelbiu.lt seem to work for me.

Gallaecio commented 4 years ago

Closing since this has been fixed in Twisted 18.4.0.

cpatulea commented 4 years ago

I'm experiencing this in Ubuntu 18.04 (Twisted 17.9.0, OpenSSL 1.1.1). I cannot update to newer packages, but I do control my entire application. I've made this workaround in my main file, after imports:

from twisted.internet import _sslverify
def _raise(_):
  raise NotImplementedError()
_sslverify._OpenSSLECCurve = _raise

This should probably be used only as a last resort if libraries cannot be updated.

iamarifdev commented 2 years ago

I'm experiencing this in Ubuntu 18.04 (Twisted 17.9.0, OpenSSL 1.1.1). I cannot update to newer packages, but I do control my entire application. I've made this workaround in my main file, after imports:

from twisted.internet import _sslverify
def _raise(_):
  raise NotImplementedError()
_sslverify._OpenSSLECCurve = _raise

This should probably be used only as a last resort if libraries cannot be updated.

Its working for the version 1.4.0.