taganaka / polipus

Polipus: distributed and scalable web-crawler framework
MIT License
92 stars 32 forks source link

Thread seems to hang in HTTP Call #9

Open hendricius opened 10 years ago

hendricius commented 10 years ago

Hi!

it seems one of our threads is stuck in an HTTP call. I think the function is:

https://github.com/taganaka/polipus/blob/master/lib/polipus/http.rb#L170

It looks like the connection is never closed. Any idea what this could be?

Thanks!

Here is a full stacktrace:

/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/protocol.rb:155:in `rescue in rbuf_fill'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/protocol.rb:152:in `rbuf_fill'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/protocol.rb:134:in `readuntil'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/protocol.rb:144:in `readline'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/http/response.rb:39:in `read_status_line'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/http/response.rb:28:in `read_new'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/http.rb:1406:in `block in transport_request'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/http.rb:1403:in `catch'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/http.rb:1403:in `transport_request'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/http.rb:1376:in `request'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/gems/rest-client-1.6.7/lib/restclient/net_http_ext.rb:51:in `request'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/bundler/gems/polipus-b4a3ce1be226/lib/polipus/http.rb:149:in `get_response'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/bundler/gems/polipus-b4a3ce1be226/lib/polipus/http.rb:123:in `get'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/bundler/gems/polipus-b4a3ce1be226/lib/polipus/http.rb:32:in `fetch_pages'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/bundler/gems/polipus-b4a3ce1be226/lib/polipus.rb:179:in `block (3 levels) in takeover'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/gems/redis-queue-0.0.3/lib/redis/queue.rb:56:in `block in process'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/gems/redis-queue-0.0.3/lib/redis/queue.rb:54:in `loop'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/gems/redis-queue-0.0.3/lib/redis/queue.rb:54:in `process'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/bundler/gems/polipus-b4a3ce1be226/lib/polipus.rb:154:in `block (2 levels) in takeover'
taganaka commented 10 years ago

I was working on a similar issue. Can you please try a smoke test with this branch? https://github.com/taganaka/polipus/tree/proxy_no_cache

A connection is the refreshed correctly after 3 attempts

New parameters has been added for a more fine grain http timeouts controls:

 # HTTP open connection timeout in seconds
:open_timeout => 10,
# Mark a connection as staled after connection_max_hits request
:connection_max_hits => nil

Let me know how it goes

hendricius commented 10 years ago

Hey Francesco,

thanks for the response. Just tested it with your branch. It seems the thread is still hanging. I diagnosed it, it must be in one of the following lines:

/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/protocol.rb:155:in `select'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/protocol.rb:155:in `rescue in rbuf_fill'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/protocol.rb:152:in `rbuf_fill'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/protocol.rb:134:in `readuntil'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/protocol.rb:144:in `readline'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/http/response.rb:39:in `read_status_line'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/http/response.rb:28:in `read_new'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/http.rb:1406:in `block in transport_request'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/http.rb:1403:in `catch'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/http.rb:1403:in `transport_request'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/2.0.0/net/http.rb:1376:in `request'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/gems/rest-client-1.6.7/lib/restclient/net_http_ext.rb:51:in `request'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/bundler/gems/polipus-3a8e7de1a245/lib/polipus/http.rb:167:in `get_response'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/bundler/gems/polipus-3a8e7de1a245/lib/polipus/http.rb:141:in `get'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/bundler/gems/polipus-3a8e7de1a245/lib/polipus/http.rb:33:in `fetch_pages'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/bundler/gems/polipus-3a8e7de1a245/lib/polipus.rb:188:in `block (3 levels) in takeover'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/gems/redis-queue-0.0.3/lib/redis/queue.rb:56:in `block in process'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/gems/redis-queue-0.0.3/lib/redis/queue.rb:54:in `loop'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/gems/redis-queue-0.0.3/lib/redis/queue.rb:54:in `process'
/home/deployer/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/bundler/gems/polipus-3a8e7de1a245/lib/polipus.rb:163:in `block (2 levels) in takeover'
taganaka commented 10 years ago

Hi @hendricius Can't reproduce even after weeks of running Is that the full backtrace?

hendricius commented 10 years ago

Hi,

yep. The problem happens when we are crawling some websites that block us after giving a timeout. Some of them also just return a 500 server error. It seems the crawler continuously tries to request the URL again and hangs in there.

-hendrik

hendricius commented 10 years ago

Update:

I think it is a problem with the http library of ruby. It seems not to be threadsafe.

http://stackoverflow.com/questions/25803089/is-ruby-2-1-2-timeout-still-not-thread-safe#comment40438901_25803089

We are getting this randomly every 3-4 weeks. If I find a solution I will update here.

The issue is here: https://github.com/ruby/ruby/blob/ruby_2_1/lib/net/http.rb#L879

tmaier commented 10 years ago

We once considered to use Excon instead. (see https://github.com/taganaka/polipus/pull/37#issuecomment-46028105, item 5)

Would this help?

hendricius commented 10 years ago

@tmaier eventually that would work. I guess the best would be to just allow adding your own http library that should be used. What do you think?

I read from a few people that they are having issues with threadsafety and the default ruby http library.

taganaka commented 10 years ago

@hendricius @tmaier definitively on our todo list.

hendricius commented 10 years ago

I think we should go for: https://github.com/lostisland/faraday

That will make things a lot easier as people can just wire their own library.