tweetstream / em-twitter

Twitter Streaming API client for EventMachine
http://rubygems.org/gems/em-twitter
MIT License
42 stars 16 forks source link

Fix reconnect #7

Closed nono closed 11 years ago

nono commented 12 years ago

Hi,

I've made some changes to em-twitter to make reconnection logic works as expected. I wrote no tests and I won't have time to make them, sorry.

oz commented 12 years ago

Nice one. :)

sethvargo commented 12 years ago

Here's my code. It's broken and I won't have time to fix it.

nono commented 12 years ago

As you want...

stve commented 12 years ago

I think there's a misunderstanding here guys, we absolutely want any issues resolved and happily accept pull requests.

@nono, could you add tests for what's changed? Don't auto-reconnect after a stop should be an easy test case.

As for the reconnectors, I've tried to test reconnection thoroughly and there's quite a few tests in https://github.com/spagalloco/em-twitter/blob/master/spec/em-twitter/connection_reconnect_spec.rb to use as a starting point. Thanks.

nono commented 12 years ago

I understand that this code should not be merged without tests. I've passed a few hours to debug it on a real case and I hope someone can make it a real tested patch. I won't have time for that (going to holidays soon), so I'm sorry but it's the best I can do for the moment.

stve commented 12 years ago

I've implemented the auto-reconnect on stop fix in eb38a7e, so that just leaves the reconnector changes. These can be very difficult things to track down so I appreciate the work you did in identifying the issues. It might take me a few days to pull the rest of this in.

And I agree, this definitely seems related to intridea/tweetstream#84

stve commented 12 years ago

Could anyone who's interested eyeball f1b41d7 for accuracy. The tests were difficult to write but I think I've caught all the caess. @nono @oz if you have been running off @nono's fork, could you try pointing at master to see if everything looks ok? Thanks guys.

nono commented 12 years ago

Good job, but I think some corner cases are still not covered. I can explain the 3 issues I saw when debugging:

  1. HTTP error, then success, then network error => the reconnector was ApplicationFailure instead of NetworkFailure and so, it uses a timeout longer than needed.
  2. Many HTTP errors => at each error, the on_body is called and creates a new ApplicationFailure object, so the retries count wasn't acurate. I think this one explains some behaviors of https://github.com/intridea/tweetstream/issues/84:
TIMEOUT: 10 RETRIES: 0
TIMEOUT: 10 RETRIES: 0
TIMEOUT: 10 RETRIES: 0
  1. After fixing 1. and 2., the reconnectors were never reset. So, some HTTP errors, then OK for 10 minutes, then an HTTP error gives a timeout of 320s.

Thanks for taking time for this unglamorous work :)

stve commented 12 years ago

@nono, that's a good thought to test that the reconnector actually increments properly under expected scenarios. Thanks for the tips.

denen99 commented 12 years ago

is the gem updated to reflect this change?

denen99 commented 12 years ago

Also, noticing with the 0.1.4 version of the gem getting constant reconnect errors. With 0.1.3 everything seems fine.

stve commented 12 years ago

@denen99 not yet. I have a few more test cases I want to cover before releasing a new version.

The only change in 0.1.4 is 9c760af640097a03e75492f5dae83a69608f0c7d which closes the connection on a stall, i haven't seen evidence of stalled? working incorrectly, but will take a look. Are you tracking keywords where a stall is likely/possible?

denen99 commented 12 years ago

its possible but as soon as I use 0.1.3, i get no reconnect errors. If i use 0.1.4, i hit the Reconnect Error with Timeout of 10 with 0 retries error quite quickly.

Is there a way to get more debug on when stalls happen ?

denen99 commented 12 years ago

Is there a way to determine the HTTP response code ?

stve commented 12 years ago

@denen99 The response code isn't currently accessible but you can retrieve it via instance_variable_get on the Connection. I can make it accesible, is there a reason you'd need it?

stve commented 12 years ago

as for when stalls occur, use the Client#on_no_data_received callback. The reason you don't see the reconnects in 0.1.3 is that stalls were not triggering reconnects.