igrigorik / em-http-request

Asynchronous HTTP Client (EventMachine + Ruby)
1.22k stars 220 forks source link

Could not parse data entirely #84

Closed rb2k closed 13 years ago

rb2k commented 13 years ago

I cought this using the EM.error_handler callback

E, [2011-03-30T09:13:45.069043 #9214] ERROR -- : Error raised during event loop: Could not parse data entirely
["/usr/local/lib/ruby/gems/1.9.1/gems/em-http-request-1.0.0.beta.1/lib/em http/http_connection.rb:80:in `<<'", "/usr/local/lib/ruby/gems/1.9.1/gems/em-http-request-1.0.0.beta.1/lib/em-http/http_connection.rb:80:in `receive_data'", "/usr/local/lib/ruby/gems/1.9.1/gems/eventmachine-0.12.11/lib/eventmachine.rb:195:in `run_machine'", "/usr/local/lib/ruby/gems/1.9.1/gems/eventmachine-0.12.11/lib/eventmachine.rb:195:in `run'", "test.rb:383:in `<main>'"]

Line 383 in test.rb is just the beginning of the eventmachine loop. Any idea?

rb2k commented 13 years ago

just noticed beta3 is already out. updated. will report if the problem persists

igrigorik commented 13 years ago

Hmm, yeah I think that should be handled now. Let me know if not!

rb2k commented 13 years ago

Maybe related? This time with 1.0.3beta of em-http and eventmachine

E, [2011-03-30T16:39:31.412601 #12217] ERROR -- : Error raised during event loop: undefined method `on_error' for nil:NilClass
["/usr/local/lib/ruby/gems/1.9.1/gems/em-http-request-1.0.0.beta.3/lib/em-http/http_connection.rb:84:in `rescue in receive_data'", "/usr/local/lib/ruby/gems/1.9.1/gems/em-http-request-1.0.0.beta.3/lib/em-http/http_connection.rb:80:in `receive_data'", "/usr/local/lib/ruby/gems/1.9.1/gems/eventmachine-1.0.0.beta.3/lib/eventmachine.rb:199:in `run_machine'", "/usr/local/lib/ruby/gems/1.9.1/gems/eventmachine-1.0.0.beta.3/lib/eventmachine.rb:199:in `run'", "crawler_evented.rb:400:in `<main>'"]
igrigorik commented 13 years ago

Do you have the URL that's generating the error?

rb2k commented 13 years ago

Sadly no, it got cought by the EM.error_handler{} and that didn't give me any more details :( I hope I can figure out a way to log the error where it is happening to give you more information

igrigorik commented 13 years ago

That would be great.. any hints would be helpful, otherwise I'm shooting in the dark here.

rb2k commented 13 years ago

Just to make sure I understood the source:

It seems to be an HTTP Parser Error during the receive_data callback and at this time, there doesn't seem to be a client in the @clients array (@clients.shift returns nil).
Is that about right?

igrigorik commented 13 years ago

Yep. The trick is in figuring out how that case is being triggered in the first place - that part is evading me at the moment. Are you using redirects by any chance?

rb2k commented 13 years ago

Yes, redirects are enabled

igrigorik commented 13 years ago

@rb2k: reviewing the code.. another question: are you using keep alive or pipelining?

Is this a frequent error?

rb2k commented 13 years ago

Neither one. I'm only making 1 HTTP connection per domain anyway :)
It is a pretty frequent error (every 5-10 minutes or so (about 30 http calls/second).
I will try to make a special debug built that only does one request at a time and just setting a global variable so I can print the URL that breaks it

igrigorik commented 13 years ago

heads up: #86 may be related to what you're seeing here.

rb2k commented 13 years ago

Ha! Got one :)

Code: require "em-http-request" EventMachine.run { http = EventMachine::HttpRequest.new('http://levishlomi.com/').get(:timeout => 30, :redirects => 10) http.errback { p 'Uh oh'; EM.stop } http.callback { p http.response_header.status p http.response_header p http.response

        EventMachine.stop
      }
    }

Results in:

$ ruby fail.rb 
"Uh oh"
/Users/mseeger/.rvm/gems/ruby-1.9.2-p136/gems/em-http-request-1.0.0.beta.3/lib/em-http/http_connection.rb:84:in `rescue in receive_data': undefined method `on_error' for nil:NilClass (NoMethodError)
from /Users/mseeger/.rvm/gems/ruby-1.9.2-p136/gems/em-http-request-1.0.0.beta.3/lib/em-http/http_connection.rb:80:in `receive_data'
from /Users/mseeger/.rvm/gems/ruby-1.9.2-p136/gems/eventmachine-1.0.0.beta.3/lib/eventmachine.rb:199:in `run_machine'
from /Users/mseeger/.rvm/gems/ruby-1.9.2-p136/gems/eventmachine-1.0.0.beta.3/lib/eventmachine.rb:199:in `run'
from fail.rb:3:in `<main>'

And just in case it goes down: $ curl -I levishlomi.com HTTP/1.1 302 Found Date: Thu, 07 Apr 2011 09:39:16 GMT Server: Microsoft-IIS/6.0 X-Powered-By: ASP.NET X-AspNet-Version: 2.0.50727 Location: http://joydjs.com Cache-Control: private Content-Type: text/html; charset=utf-8 Content-Length: 134

It seems to be timing related though, crashes in 1 out of 5 tries for me

igrigorik commented 13 years ago

Ugh, that's a nasty one... So, the actual problem is upstream:

$> curl http://joydjs.com curl: (56) Received problem 2 in the chunky parser

The URL it redirects to does something rather silly:

HTTP/1.0 200 OK Date: Fri, 08 Apr 2011 03:49:06 GMT Server: Apache/2.2.14 (Debian) Set-Cookie: SESSIONID=kp2zdTJsOwqVwIeAVGX-vs6SMqZfht3jgsBSzdffonuxyC3s1jPzItLKCKKjp_XW; path=/ Expires: Thu, 19 Nov 1981 08:52:00 GMT Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Pragma: no-cache Vary: Accept-Encoding Transfer-Encoding: chunked Content-Type: text/html X-Cache: MISS from localhost X-Cache-Lookup: MISS from localhost:80 Via: 1.0 localhost (squid/3.0.STABLE19) Connection: close

Namely, it claims chunked encoding, with Connection:close over HTTP/1.0 - that is one heck of a mix. In other words, we close the connection but the server then passes us the next chunk, and hence we explode since the client is gone.

This should avoid the blow up: https://github.com/igrigorik/em-http-request/commit/992df31da2650da992a3278c0535f92e2a730317

I'm not convinced that's the best solution, but atm, can't think of a better way to handle this.

igrigorik commented 13 years ago

By the way, forgot to add.. :host parameter doesn't actually do anything in the 1.0 beta's - just a heads up.

rb2k commented 13 years ago

Thanks for the fix, trying it in production now :)
So 1.0 beta does the async DNS resolving itself and I can just kick that part out of my workflow?
I currently need to know which hosts don't resolve so I can delete them from a database, would that come back via the errback?

igrigorik commented 13 years ago

Unfortunately the reason the reason for the "beta" flag is because I'm waiting for async DNS to call em-http 1.0.. (cc @tmm1 :)).

When refactoring 1.0 to use the new parser I dropped the :host parameter because it seemed redundant: you can achieve the same behavior by combination of preparsing the uri and manually updating its host.

If you need to keep track of which failed, then the errback should be invoked in case of a bad DNS resolve and you can access the uri of the client and invoke all your logic from there.

igrigorik commented 13 years ago

Resolved, will push a new beta version today - closing.