scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

SSL handshake failure #2424

Closed briehanlombaard closed 6 years ago

briehanlombaard commented 7 years ago

Hi,

I'm getting a handshake error for the sites listed below:

2016-12-03 00:02:19 [scrapy] ERROR: Error downloading <GET https://apnews.com>: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'ssl23_read', 'ssl handshake failure')]>]
2016-12-03 00:03:25 [scrapy] ERROR: Error downloading <GET https://techcrunch.com>: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'ssl23_read', 'ssl handshake failure')]>]
2016-12-03 00:03:53 [scrapy] ERROR: Error downloading <GET https://medium.com>: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'ssl3_read_bytes', 'ssl handshake failure')]>]
2016-12-03 00:05:08 [scrapy] ERROR: Error downloading <GET https://theintercept.com>: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'ssl23_read', 'ssl handshake failure')]>]
2016-12-03 00:06:32 [scrapy] ERROR: Error downloading <GET https://www.opendemocracy.net/>: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'ssl3_read_bytes', 'ssl handshake failure')]>]
2016-12-03 00:07:55 [scrapy] ERROR: Error downloading <GET https://www.rt.com>: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'ssl23_read', 'ssl handshake failure')]>]
2016-12-03 00:19:53 [scrapy] ERROR: Error downloading <GET https://www.thestar.com>: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'ssl3_read_bytes', 'ssl handshake failure')]>]
2016-12-03 00:58:42 [scrapy] ERROR: Error downloading <GET https://www.cnet.com/>: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'ssl3_read_bytes', 'ssl handshake failure')]>]

What's strange is that it works if I try each one of those sites individually using scrapy shell so I might be doing something wrong.

Here's some information about my environment:

$ scrapy version -v
Scrapy    : 1.2.1
lxml      : 3.6.4.0
libxml2   : 2.9.4
Twisted   : 16.6.0
Python    : 2.7.12 (default, Jul  1 2016, 15:12:24) - [GCC 5.4.0 20160609]
pyOpenSSL : 16.2.0 (OpenSSL 1.0.2g-fips  1 Mar 2016)
Platform  : Linux-3.13.0-52-generic-x86_64-with-Ubuntu-16.04-xenial

Any ideas where I can look to troubleshoot the problem?

redapple commented 7 years ago

@briehanlombaard , that's strange indeed. Are you using any HTTP proxies in your config? Does it happen all of a sudden?

briehanlombaard commented 7 years ago

@redapple I'm not using any proxies and it's been happening consistently for a while. I'll see if I can reproduce it in an isolated environment.

zhuo2015 commented 7 years ago

Scrapy : 1.3.2 lxml : 3.7.3.0 libxml2 : 2.9.4 cssselect : 1.0.1 parsel : 1.1.0 w3lib : 1.17.0 Twisted : 17.1.0 Python : 3.5.2 |Anaconda custom (64-bit)| (default, Jul 5 2016, 11:41:13) pyOpenSSL : 16.2.0 (OpenSSL 1.0.2k 26 Jan 2017) Platform : Windows-8-6.2.9200-SP0

I met the issue too, without any proxies, after a normal running, it's been happening consistently.

dansmachina commented 6 years ago

Same here. I've been having the same error trying to connect to a twitter url. I'm not using any proxies nor other special configuration. I've been trying with all the workarounds I've found related to generate a CustomContextFactory but stills...

The weird part is that it only happens sometimes, I'm scraping a bunch of urls and some of them fails but not all... Maybe is something related to the certificate but if I take the url and paste it on the browser, it works without any problems...

twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'ssl23_read', 'ssl handshake failure')]>]

I'm running the scrapy project in my personal computer with this configuration:

And also inside a python:3 docker image.

Any idea? :cry:

skizoforme commented 6 years ago

I have the same problem. I start in a page with a list of links and then the spider continue following all them. The problem is that for some of them (not all) I have the error:

2017-06-06 17:04:31 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www[...]> (failed 1 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'ssl23_read', 'ssl handshake failure')]>]
2017-06-06 23:00:44 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www[...]> (failed 2 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'ssl23_read', 'ssl handshake failure')]>]
2017-06-11 18:02:44 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://www[...]> (failed 3 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'ssl23_read', 'ssl handshake failure')]>]
2017-06-11 18:03:09 [scrapy.core.scraper] ERROR: Error downloading <GET https://www[...]>: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'ssl23_read', 'ssl handshake failure')]>]

This error appear only for some pages (180 of 290), and for all them the retries fail also and finally appears "ERROR: Error downloading...." The rest of pages are correctly crawled at the first time, that is, each page is correctly crawled at first time or fail all the attempts to crawl it, but if I run the spider again then the pages which fail change.

Furthermore, there is no a point from which all the pages fail, but some of them fail and some not.

I don't know if I am being banned or there are another reason.

My configuration:

And in the settings I have customized the following:

USER_AGENT = (
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/58.0.3029.110 '
    'Chrome/58.0.3029.110 Safari/537.36')
COOKIES_ENABLED = False
DOWNLOAD_DELAY = 10
redapple commented 6 years ago

@skizoforme , you'll need to provide URLs for which you are getting these handshake failures. There's no easy way to determine what might cause this without network captures (Wireshark, tcpdump). You can also try with openssl s_client -connect www.somedomain.com:443 -servername www.somedomain.com, and if that works, then scrapy/twisted is configuring something wrong.

skizoforme commented 6 years ago

@redapple I have been capturing the network with Wireshark and when the error appears in wireshark I have

githug

so the server should be rejecting my queries. I will try to reduce the number of requests.

redapple commented 6 years ago

@skizoforme , so the server is closing its side of the TCP connection after the handshake finished? If so, that's not a SSL handshake failure. (It's hard to see in your screen capture. If you can post the .pcap file somewhere for download, it would be easier to check)

skizoforme commented 6 years ago

@redapple, the file is here:

https://drive.google.com/file/d/0By4W-doeT7shcm9PamF1d0RqR3M/view

The error appears at the 275.8 seconds.

redapple commented 6 years ago

Thanks @skizoforme , the server seems to reply with a ServerHello, so the TLS handshake looks successful. (Although the capture file is missing some TCP segments.) I don't know what is causing this server closing the connection, and it may not be Scrapy related. If you can try openssl s_client -connect www.somedomain.com:443 -servername www.somedomain.com for whoever is 104.150.35.32, that could help. In any case, this look hard to investivate if there's no reproducible scrapy or twisted code for this error. It's probably worth a seperate issue.

skizoforme commented 6 years ago

I put that command but I don't obtain any error. I have reviewed the scrapy logs and in the first execution the error was produced after 20 hours but now it starts to appear few minutes later, furthermore I have been trying with another machine and up to now I haven't had any error (if I haven't overlooked nothing the environments for the two machines are equal) So I think that the IP of the first machine may be in a black list and its request are very restricted. What do you think @redapple?

redapple commented 6 years ago

Could be it @skizoforme . Hard to say really. At least, it doesn't look like a TLS issue, more of a higher level layer that is closing the connection.

skizoforme commented 6 years ago

I am going to try with the other machine and if I obtain more information about the error I'll comment. Thanks for the help!

redapple commented 6 years ago

@skizoforme , did you figure this out?

skizoforme commented 6 years ago

Sorry @redapple for not answering before. I haven't. I have run the spider again and the error persists, furthermore when the errors start to appear I have left this spider running and I have launch another in the same machine but it works well (during one or two hours, when this spiders starts to fail also), so I have discarded to being banned. I don't know yet what is the problem.

redapple commented 6 years ago

Thanks for the update

YPersonal commented 6 years ago

I also encountered the smae problem……Do you have a good solution?

2017-09-07 11:02:04 [scrapy.core.scraper] ERROR: Error downloading <GET https://......>: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL23_READ', 'ssl handshake failure')]>] 2017-09-07 11:02:05 [scrapy.core.scraper] ERROR: Error downloading <GET https://.....>: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL23_READ', 'ssl handshake failure')]>] 2017-09-07 11:02:08 [scrapy.core.scraper] ERROR: Error downloading <GET https://......>: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL23_READ', 'ssl handshake failure')]>]

redapple commented 6 years ago

@YPersonal , can you detail which versions of Scrapy, Twisted and pyOpenSSL and cryptography you are using? (the output of scrapy version -v should provide all but the cryptography version)

I was able to test with the URLs you initially posted and I got HTTP 200, and with these:

$ scrapy version -v
Scrapy    : 1.4.0
lxml      : 3.8.0.0
libxml2   : 2.9.3
cssselect : 1.0.1
parsel    : 1.2.0
w3lib     : 1.18.0
Twisted   : 17.5.0
Python    : 3.6.2 (default, Aug 24 2017, 10:48:24) - [GCC 6.3.0 20170406]
pyOpenSSL : 17.2.0 (OpenSSL 1.0.2g  1 Mar 2016)
Platform  : Linux-4.10.0-33-generic-x86_64-with-debian-stretch-sid

$ pip freeze
asn1crypto==0.22.0
attrs==17.2.0
Automat==0.6.0
cffi==1.10.0
constantly==15.1.0
cryptography==1.9
cssselect==1.0.1
hyperlink==17.3.1
idna==2.6
incremental==17.5.0
lxml==3.8.0
parsel==1.2.0
pyasn1==0.3.3
pyasn1-modules==0.1.1
pycparser==2.18
PyDispatcher==2.0.5
pyOpenSSL==17.2.0
queuelib==1.4.2
Scrapy==1.4.0
service-identity==17.0.0
six==1.10.0
Twisted==17.5.0
w3lib==1.18.0
zope.interface==4.4.2

You can also check is openssl s_client -connect somedomain.com:443 -servername somedomain.com works correctly or not (that is, using OpenSSL's defaults out of the box, without Scrapy or Twisted in the way)

YPersonal commented 6 years ago

@redapple Thanks for your reply. I also check it by openssl as follow: CONNECTED(00000003) write:errno=104 no peer certificate available No client certificate CA names sent SSL handshake has read 0 bytes and written 269 bytes New, (NONE), Cipher is (NONE) Secure Renegotiation IS NOT supported Compression: NONE Expansion: NONE

YPersonal commented 6 years ago

@redapple scrapy version -v Scrapy : 1.4.0 lxml : 3.3.5.0 libxml2 : 2.9.1 cssselect : 0.9.1 parsel : 1.2.0 w3lib : 1.17.0 Twisted : 17.5.0 Python : 2.7.5 (default, Nov 6 2016, 00:28:07) - [GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] pyOpenSSL : 17.0.0 (OpenSSL 1.0.1e-fips 11 Feb 2013) Platform : Linux-3.10.0-514.21.1.el7.x86_64-x86_64-with-centos-7.3.1611-Core

redapple commented 6 years ago

So OpenSSL's client cannot connect either. This looks similar to https://serverfault.com/questions/807765/cannot-complete-ssl-handshake-with-one-server-from-gce-ubuntu-16-04-1-image-but I don't have any advice at this point, I'm afraid.

YPersonal commented 6 years ago

@redapple Thank you very much

cathalgarvey commented 6 years ago

Hi @YPersonal - This particular issue has gone stale, so I'll close it. But the SSL/TLS issue continues in other issues here, and in some cases it's caused by the evolution of the binary builds of PyOpenSSL and Cryptography for various platforms. Sometimes, ciphers get removed wholesale from these libraries as a security improvement for application-level users, and it affects use-cases where security and confidentiality is not as urgent, such as most web-scraping tasks.

Thank you for your contribution to Scrapy so far!

singhalhimanshu commented 5 years ago

Getting the same error: Website need to crawl: ["https://www.labor.ny.gov/"] Installed scrapy details Scrapy : 1.6.0 lxml : 4.2.5.0 libxml2 : 2.9.8 cssselect : 1.0.3 parsel : 1.5.1 w3lib : 1.20.0 Twisted : 18.7.0 Python : 3.6.1 |Continuum Analytics, Inc.| (default, May 11 2017, 13:25:24) [MSC v.1900 64 bit (AMD64)] pyOpenSSL : 19.0.0 (OpenSSL 1.0.2p 14 Aug 2018) cryptography : 2.3.1 Platform : Windows-10-10.0.17763-SP0

ERROR: Retrying <GET https://www.labor.ny.gov> (failed 1 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'ssl23_read', 'ssl handshake failure')]>]