scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.23k stars 10.36k forks source link

OpenSSLError 'unexpected eof while reading' openssl #5835

Closed necronet closed 1 year ago

necronet commented 1 year ago

Description

Hi I have been getting an error when trying to run scrapy shell on a site, unfortunately after trying to figure out I have failed to get at least the root cause of what is going on. Here is the error I have

  File "/Users/joseayerdis/.pyenv/versions/3.10.4/lib/python3.10/site-packages/twisted/internet/threads.py", line 119, in blockingCallFromThread
    result.raiseException()
  File "/Users/joseayerdis/.pyenv/versions/3.10.4/lib/python3.10/site-packages/twisted/python/failure.py", line 475, in raiseException
    raise self.value.with_traceback(self.tb)
twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', '', 'unexpected eof while reading')]>]

Steps to Reproduce

  1. Run scrapy shell https://property-nicaragua.com/listing/casa-wahoo-above-the-surf/

Expected behavior: Should return HTTP response with webpage data

Actual behavior: <twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', '', 'unexpected eof while reading')]

Reproduces how often: Everytime

Versions

Scrapy       : 2.5.0
lxml         : 4.6.3.0
libxml2      : 2.9.13
cssselect    : 1.1.0
parsel       : 1.6.0
w3lib        : 1.22.0
Twisted      : 21.7.0
Python       : 3.10.4 (main, Jan 16 2023, 23:43:37) [Clang 14.0.0 (clang-1400.0.29.202)]
pyOpenSSL    : 22.0.0 (OpenSSL 3.0.5 5 Jul 2022)
cryptography : 37.0.4
Platform     : macOS-13.1-arm64-arm-64bit

Additional context

Other sites run correctly the issue arise only on this site, so it's possible that is an anti-crawler feature.

Gallaecio commented 1 year ago

Could you try with Scrapy 2.8?

necronet commented 1 year ago

Could you try with Scrapy 2.8?

I ran into the same issue when upgrading

Scrapy       : 2.8.0
lxml         : 4.6.3.0
libxml2      : 2.9.13
cssselect    : 1.1.0
parsel       : 1.6.0
w3lib        : 1.22.0
Twisted      : 21.7.0
Python       : 3.10.4 (main, Jan 16 2023, 23:43:37) [Clang 14.0.0 (clang-1400.0.29.202)]
pyOpenSSL    : 22.0.0 (OpenSSL 3.0.5 5 Jul 2022)
cryptography : 37.0.4
Platform     : macOS-13.1-arm64-arm-64bit
Gallaecio commented 1 year ago

I could not reproduce the issue with a fresh install. Maybe you need to upgrade additional deps? (e.g. cryptography, pyOpenSSL)

$ rm -rf venv/
$ python3 -m venv venv
$ . venv/bin/activate
$ pip install scrapy
[…]
Successfully installed Automat-22.10.0 PyDispatcher-2.0.7 Twisted-22.10.0 attrs-22.2.0 certifi-2022.12.7 cffi-1.15.1 charset-normalizer-3.0.1 constantly-15.1.0 cryptography-39.0.1 cssselect-1.2.0 filelock-3.9.0 hyperlink-21.0.0 idna-3.4 incremental-22.10.0 itemadapter-0.7.0 itemloaders-1.0.6 jmespath-1.0.1 lxml-4.9.2 packaging-23.0 parsel-1.7.0 protego-0.2.1 pyOpenSSL-23.0.0 pyasn1-0.4.8 pyasn1-modules-0.2.8 pycparser-2.21 queuelib-1.6.2 requests-2.28.2 requests-file-1.5.1 scrapy-2.8.0 service-identity-21.1.0 six-1.16.0 tldextract-3.4.0 typing-extensions-4.5.0 urllib3-1.26.14 w3lib-2.1.1 zope.interface-5.5.2
[…]
$ scrapy shell https://property-nicaragua.com/listing/casa-wahoo-above-the-surf/
[…]
2023-02-25 08:01:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://property-nicaragua.com/listing/casa-wahoo-above-the-surf/> (referer: None)
[…]
>>> 
necronet commented 1 year ago

I could not reproduce the issue with a fresh install. Maybe you need to upgrade additional deps? (e.g. cryptography, pyOpenSSL)

Thanks for the follow-up @Gallaecio I did a fresh install, I'm using pyenv as version management, but the problem persists, as you can see bellow I even upgraded python version to get a completely fresh version of the packages.

Scrapy       : 2.8.0
lxml         : 4.9.2.0
libxml2      : 2.10.3
cssselect    : 1.1.0
parsel       : 1.6.0
w3lib        : 1.22.0
Twisted      : 21.7.0
Python       : 3.11.1 (main, Feb 26 2023, 11:38:14) [Clang 14.0.0 (clang-1400.0.29.202)]
pyOpenSSL    : 22.0.0 (OpenSSL 3.0.5 5 Jul 2022)
cryptography : 37.0.4
Platform     : macOS-13.1-arm64-arm-64bit

Here is the full stacktrace of the error

Traceback (most recent call last):
  File "/Users/joseayerdis/.pyenv/versions/3.11.1/bin/scrapy", line 8, in <module>
    sys.exit(execute())
             ^^^^^^^^^
  File "/Users/joseayerdis/.pyenv/versions/3.11.1/lib/python3.11/site-packages/scrapy/cmdline.py", line 158, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/Users/joseayerdis/.pyenv/versions/3.11.1/lib/python3.11/site-packages/scrapy/cmdline.py", line 111, in _run_print_help
    func(*a, **kw)
  File "/Users/joseayerdis/.pyenv/versions/3.11.1/lib/python3.11/site-packages/scrapy/cmdline.py", line 166, in _run_command
    cmd.run(args, opts)
  File "/Users/joseayerdis/.pyenv/versions/3.11.1/lib/python3.11/site-packages/scrapy/commands/shell.py", line 84, in run
    shell.start(url=url, redirect=not opts.no_redirect)
  File "/Users/joseayerdis/.pyenv/versions/3.11.1/lib/python3.11/site-packages/scrapy/shell.py", line 44, in start
    self.fetch(url, spider, redirect=redirect)
  File "/Users/joseayerdis/.pyenv/versions/3.11.1/lib/python3.11/site-packages/scrapy/shell.py", line 119, in fetch
    response, spider = threads.blockingCallFromThread(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/joseayerdis/.pyenv/versions/3.11.1/lib/python3.11/site-packages/twisted/internet/threads.py", line 119, in blockingCallFromThread
    result.raiseException()
  File "/Users/joseayerdis/.pyenv/versions/3.11.1/lib/python3.11/site-packages/twisted/python/failure.py", line 475, in raiseException
    raise self.value.with_traceback(self.tb)
twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', '', 'unexpected eof while reading')]>]
necronet commented 1 year ago

Today I ran the command outside my project root, there's got to be a settings that is messing with the SSL handshake.

I'm going to go ahead and close this issue, whenever I figure what is making the crawler failed I'll let you know!

Thanks again for helping me figure this out

Edit 1:

Following up on this issue on StackOverflow.

jfzlma commented 1 month ago

I'm still facing this issue, not only with any specific url, but also with random urls. image

cryptography        38.0.4
Scrapy              2.5.0
pyOpenSSL           22.0.0

i'm behind a proxy.

Gallaecio commented 1 month ago

It only makes sense to reopen the ticket if you are facing the issue with the latest version of both Scrapy and deps, which is not the case.