scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

Scrapy chokes on HTTP response status lines without a Reason phrase #345

Closed tonal closed 6 years ago

tonal commented 10 years ago

Try fetch page:

$ scrapy fetch 'http://www.gidroprofmontag.ru/bassein/sbornue_basseynu'

output:

2013-07-11 09:15:37+0400 [scrapy] INFO: Scrapy 0.17.0-304-g3fe2a32 started (bot: amon)
/home/tonal/amon/amon/amon/downloadermiddleware/blocked.py:6: ScrapyDeprecationWarning: Module `scrapy.stats` is deprecated, use `crawler.stats` attribute instead
  from scrapy.stats import stats
2013-07-11 09:15:37+0400 [amon_ra] INFO: Spider opened
2013-07-11 09:15:37+0400 [amon_ra] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-07-11 09:15:37+0400 [amon_ra] ERROR: Error downloading <GET http://www.gidroprofmontag.ru/bassein/sbornue_basseynu>: [<twisted.python.failure.Failure <class 'scrapy.xlib.tx._newclient.ParseError'>>]
2013-07-11 09:15:37+0400 [amon_ra] INFO: Closing spider (finished)
2013-07-11 09:15:37+0400 [amon_ra] INFO: Dumping Scrapy stats:
        {'downloader/exception_count': 1,
         'downloader/exception_type_count/scrapy.xlib.tx._newclient.ResponseFailed': 1,
         'downloader/request_bytes': 256,
         'downloader/request_count': 1,
         'downloader/request_method_count/GET': 1,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2013, 7, 11, 5, 15, 37, 512010),
         'log_count/ERROR': 1,
         'log_count/INFO': 4,
         'scheduler/dequeued': 1,
         'scheduler/dequeued/memory': 1,
         'scheduler/enqueued': 1,
         'scheduler/enqueued/memory': 1,
         'start_time': datetime.datetime(2013, 7, 11, 5, 15, 37, 257898)}
2013-07-11 09:15:37+0400 [amon_ra] INFO: Spider closed (finished)
dangra commented 10 years ago

http parser doesn't like reasonless status line in response

$ curl -sv http://www.gidroprofmontag.ru/bassein/sbornue_basseynu
> GET /bassein/sbornue_basseynu HTTP/1.1
> User-Agent: curl/7.27.0
> Host: www.gidroprofmontag.ru
> Accept: */*
> 
< HTTP/1.1 200
< Server: nginx
< Date: Thu, 11 Jul 2013 21:11:04 GMT
< Content-Type: text/html; charset=windows-1251
< Transfer-Encoding: chunked
< Connection: keep-alive
< Keep-Alive: timeout=5
< Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
< Expires: Thu, 19 Nov 1981 08:52:00 GMT
< Pragma: no-cache
< Set-Cookie: PHPSESSID=a95f6bd4dd61c03de33e11c049e3e970; path=/
< Set-Cookie: Apache=190.135.189.59.933341373577064378; path=/; expires=Fri, 11-Jul-14 21:11:04 GMT
< 
* Closing connection #0
dangra commented 10 years ago

As a scraping framework, we should be able to download the page and ignore the status line bug

tonal commented 10 years ago

How to handle this error?

dangra commented 10 years ago

Extend or fix twisted HTTPClientParser so it doesn't discard the response

Tony36051 commented 10 years ago

I'm scrapy user , and i got 'scrapy.xlib.tx._newclient.ResponseFailed' In scrapy shell, parse any url got the same error with something like come from twisted. So, i guess that twisted maybe the point. I should 'Extend or fix twisted HTTPClientParser so it doesn't discard the response' as dangra said, BUT, that may be TOO HARD for me, and I change my Twisted from 13.1.0 to 11.0.0 It works

pablohoffman commented 10 years ago

are we gonna fix this one @dangra ?

dangra commented 10 years ago

@Tony36051: your problem is different, it was fixed in Scrapy development branch and in Scrapy 0.18.2 stable release. Create a new issue with an url to easily reproduce it if not. thanks.

@pablohoffman: yes, happens that an extended http parser can't be easily hooked into twisted HTTP11 client. Want to take a look and discuss better approach?

I think the longterm option is to report the bug upstream and propose two things:

to access the parser from scrapy download handler we should go trough:

Everything is easy except telling HTTP11ClientProtocol to use a different HTTPClientParser

While writing about this I realized a non-monkeypatch solution, extending HTTP11ClientProtocol and use a property getter and setter for HTTP11ClientProtocol._parser attribute, the setter will convert the twisted HTTPClientParser instance into our extended version. It's not pretty but I can't see any better option. :)

Tony36051 commented 10 years ago

I was so stupid --- using global proxy and ran out of amount。 At last, all thing go ok thank you~again. Tony

kbourgoin commented 10 years ago

Hey there. We hit this problem recently -- is there a fix in the works?

dangra commented 10 years ago

@kbourgoin : the far we got is the description of a possible solution by https://github.com/scrapy/scrapy/issues/345#issuecomment-23914813

Tony36051 commented 10 years ago

I may not offer any help. But I found my problem that I have set a global proxy and ran out amount of flow. (so sorry for my poor English) after set the network well again, my srcapy program works well too. to find what wrong with the twist, I used urlopen(function in Python)to test the ability of downloading something in Python framework. and I found what I got is just like the error page from my proxy. in a word, my problem result from wrong global proxy config. best wish Tony -- 发自 Android 网易邮箱 在2013年12月03日 24:01, Keith Bourgoin写道: Hey there. We hit this problem recently -- is there a fix in the works? — Reply to this email directly or view it on GitHub.

BenedictKing commented 10 years ago

I recently solved this problem by using twisted 11.0.0 with scrapy 0.20. Thanks tips from @Tony36051 .

kmike commented 10 years ago

Is there a way to reproduce this? I've tried different twisted versions (13.2.0, 13.1.0, 10.2.0) and different scrapy versions (0.18.4, 0.22.2, scrapy master), and scrapy fetch works fine. Maybe the website changed. I'm not sure I've understood @dangra comment about reasonless status line. Here is the current curl output:

(scraping)kmike ~/scrap > curl -sv http://www.gidroprofmontag.ru/bassein/sbornue_basseynu | head
* Adding handle: conn: 0x7fd56c004000
* Adding handle: send: 0
* Adding handle: recv: 0
* Curl_addHandleToPipeline: length: 1
* - Conn 0 (0x7fd56c004000) send_pipe: 1, recv_pipe: 0
* About to connect() to www.gidroprofmontag.ru port 80 (#0)
*   Trying 89.111.176.172...
* Connected to www.gidroprofmontag.ru (89.111.176.172) port 80 (#0)
> GET /bassein/sbornue_basseynu HTTP/1.1
> User-Agent: curl/7.30.0
> Host: www.gidroprofmontag.ru
> Accept: */*
> 
< HTTP/1.1 200 OK
* Server nginx is not blacklisted
< Server: nginx
< Date: Thu, 24 Apr 2014 17:42:15 GMT
< Content-Type: text/html; charset=windows-1251
< Transfer-Encoding: chunked
< Connection: keep-alive
< Keep-Alive: timeout=5
< Expires: Thu, 19 Nov 1981 08:52:00 GMT
< Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
< Pragma: no-cache
< Set-Cookie: PHPSESSID=4e95cb26606029b00725f7d4c631f974; path=/
< Set-Cookie: Apache=176.215.38.50.1398361334922150; path=/; expires=Fri, 24-Apr-15 17:42:14 GMT
< 
{ [data not shown]
<html>
<head>
dangra commented 10 years ago

the response first line was "HTTP/1.1 200", it lacked the "OK" string.

kmike commented 10 years ago

ah, I see

tonal commented 10 years ago

My monkey patch for workaround:

def _monkey_patching_HTTPClientParser_statusReceived():
  """
  monkey patching for scrapy.xlib.tx._newclient.HTTPClientParser.statusReceived
  для обхода ошибки, когда статус выдаётся без "OK" в конце
  """
  from scrapy.xlib.tx._newclient import HTTPClientParser, ParseError
  old_sr = HTTPClientParser.statusReceived
  def statusReceived(self, status):
    try:
      return old_sr(self, status)
    except ParseError, e:
      if e.args[0] == 'wrong number of parts':
        return old_sr(self, status + ' OK')
      raise
  statusReceived.__doc__ == old_sr.__doc__
  HTTPClientParser.statusReceived = statusReceived
onbjerg commented 9 years ago

Where do we put the monkey patch? @tonal

tonal commented 9 years ago

Call monkey patch before start first request. For example In init method of You spider, or in init.py

onbjerg commented 9 years ago

Thank you very much @tonal, it worked like a charm :+1:

lbsweek commented 9 years ago

I got this error message when using VPN proxy, I capture wireshark and find there is no response. it is fine when I stop VPN proxy

dangra commented 9 years ago

@lbsweek what do you mean by "no response"? an empty reply without even a first line?

dangra commented 9 years ago

After the failed attempt to fix this issue in #1140, I think the only viable approach is a monkeypatch similar to what @tonal proposes in https://github.com/scrapy/scrapy/issues/345#issuecomment-41649779

tonal commented 8 years ago

On latest scrapy ubuntu package (0.25.0-454-gfa1039f+1429829085) I receive similar errors:

$ scrapy fetch http://only.ru/catalog/electro_oven/hiddenheater/
...
2015-06-05 12:39:39.5292+0600 [amon_ra] INFO: Spider opened
2015-06-05 12:39:39.7123+0600 [amon_ra] ERROR: Error downloading <GET http://only.ru/catalog/electro_oven/hiddenheater/>: [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-06-05 12:39:39.7137+0600 [amon_ra] INFO: Closing spider (finished)

My monkeypatch work only for scrapy.xlib.tx._newclient.ParseError bat i receive twisted.web._newclient.ParseError.

How to correct path?

dangra commented 8 years ago

My monkeypatch work only for scrapy.xlib.tx._newclient.ParseError bat i receive twisted.web._newclient.ParseError. How to correct path?

Monkey patch both.

For more info on when Scrapy uses one or the other see https://github.com/scrapy/scrapy/blob/master/scrapy/xlib/tx/__init__.py.

sunhaowen commented 8 years ago

I added tonal's monkey patch, but I still receive the same error.

2015-12-07 23:47:19 [scrapy] DEBUG: Retrying GET https://api.octinn.com/partner/baidu3600/strategy_list/1 (failed 1 times): [twisted.python.failure.Failure twisted.web._newclient.ParseError: (u'wrong number of parts', 'HTTP/1.1 200')]

-------------------------- added by 12-08 --------------------------------

I have already solved it. I found that I did not use xlib.tx._newclient.py, but twisted.web._newlicent.py. So I changed

from scrapy.xlib.tx._newclient import HTTPClientParser, ParseError

to

from twisted.web._newclient import HTTPClientParser, ParseError

that is ok now ~

liuwwei3 commented 8 years ago

@sunhaowen Great thx, it works!

|---------------------- ps. 你是monkey请来的救兵吗?

leearic commented 8 years ago

这里尽然有好多好多 CHINESE

leearic@126.com

From: liuwwei3 Date: 2015-12-25 14:59 To: scrapy/scrapy Subject: Re: [scrapy] Error download page: twisted.python.failure.Failure <class 'scrapy.xlib.tx._newclient.ParseError'> (#345) @sunhaowen Great thx, it works!你是monkey请来的救兵吗? — Reply to this email directly or view it on GitHub.

redapple commented 7 years ago

I just found out about https://twistedmatrix.com/trac/ticket/7673 The twisted team is not ready to fix it, unless someone has a real webserver in the wild that does this.

kmike commented 7 years ago

@redapple it could also be a bad proxy, not a bad server

redapple commented 7 years ago

True.

rmax commented 7 years ago

Here is a live example at this time:

$ curl -v "http://www.jindai.com.tw/"
> GET / HTTP/1.1
> Host: www.jindai.com.tw
> User-Agent: Mozilla/5.1 (MSIE; YB/9.5.1 MEGAUPLOAD 1.0)
> Accept: */*
> Referer:
>
< HTTP/1.1 200
< Status: 200
< Connection: close
lopuhin commented 7 years ago

Another example are 404 and 302 responses from okcupid (200 pages have "OK"):

$ curl -v https://www.okcupid.com/interests
> GET /interests HTTP/1.1
> Host: www.okcupid.com
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 404
rmax commented 7 years ago

It seems that this case is common with, for example, custom nginx modules which only set the response status and no reason.

rmax commented 7 years ago

Twisted has a patch ready to fix this issue: https://twistedmatrix.com/trac/ticket/7673#comment:5 PR https://github.com/twisted/twisted/pull/723 🎉

kmike commented 6 years ago

Fixed in Twisted 17.5.0.

Example websites from this ticket work for me with Scrapy 1.4.0 and Twisted 17.5.0, so I'm closing it. Thanks everyone!

mohanbe commented 5 years ago

Basically Scrapy Ignores 404 Error by Default, It was defined in httperror middleware.

So, Add HTTPERROR_ALLOW_ALL = True to your settings file.

After this you can access response.status through your parse function.

lycanthropes commented 4 years ago

Hi, everyone, now I also met this problem. My goal is to download pdf files from websites ( such as "http://114.251.10.201/pdf/month?reportId=462837&isPublic=true") ,but I can not download these pdf files completely with scrapy downloadmiddleware ( using this method, I found the size of many pdf files downloaded is 1KB ), so I turned to stream method in requests.get function [https://github.com/scrapy/scrapy/issues/3880 . But now when I running it, scrapy often get choked and says "[urllib3.connectionpool] DEBUG: http://114.251.10.201:80 'GET /pdf/month?reportId=128520&isPublic=true HTTP/1.1'" 200 None . It looks there is no failure but scrapy just chokes for several hours. Any suggestions?