logicalparadox / apnagent

Node.js adapter for Apple Push Notification (APN) Service.
http://logicalparadox.github.io/apnagent
216 stars 38 forks source link

Agents randomly stop reconnecting #15

Open bloudermilk opened 10 years ago

bloudermilk commented 10 years ago

We're having issues with agents not reconnecting sometimes after a message error. Here are the relevant log entries:

2014-08-04T20:03:58.212822+00:00 app[web.1]: apnagent:agent-live:web.1.14 [6ms] (gateway) data: 6 bytes
2014-08-04T20:03:58.214107+00:00 app[web.1]: apnagent:agent-live:web.1.14 [2ms] (gateway) disconnected - gateway.push.apple.com:2195
2014-08-04T20:03:58.214248+00:00 app[web.1]: apnagent:agent-live:web.1.14 [0ms] (gateway) end
2014-08-04T20:03:58.214862+00:00 app[web.1]: apnagent:agent-live:web.1.14 [0ms] (encoder) write: 28273
2014-08-04T20:03:58.215484+00:00 app[web.1]: apnagent:agent-live:web.1.14 [1ms] (queue) pause: not connected
2014-08-04T20:03:58.198696+00:00 app[web.1]: apnagent:agent-live:web.1.14 [1ms] (gateway) write: 28267
2014-08-04T20:03:58.199419+00:00 app[web.1]: apnagent:agent-live:web.1.14 [1ms] (cache) push: 28267
2014-08-04T20:03:58.200158+00:00 app[web.1]: apnagent:agent-live:web.1.14 [1ms] (encoder) write: 28268
2014-08-04T20:03:58.200419+00:00 app[web.1]: apnagent:agent-live:web.1.14 [0ms] (gateway) write: 28268
2014-08-04T20:03:58.200600+00:00 app[web.1]: apnagent:agent-live:web.1.14 [0ms] (cache) push: 28268
2014-08-04T20:03:58.200749+00:00 app[web.1]: apnagent:agent-live:web.1.14 [0ms] (encoder) write: 28269
2014-08-04T20:03:58.201017+00:00 app[web.1]: apnagent:agent-live:web.1.14 [0ms] (gateway) write: 28269
2014-08-04T20:03:58.201155+00:00 app[web.1]: apnagent:agent-live:web.1.14 [1ms] (cache) push: 28269
2014-08-04T20:03:58.201307+00:00 app[web.1]: apnagent:agent-live:web.1.14 [0ms] (encoder) write: 28270
2014-08-04T20:03:58.201578+00:00 app[web.1]: apnagent:agent-live:web.1.14 [0ms] (gateway) write: 28270
2014-08-04T20:03:58.201718+00:00 app[web.1]: apnagent:agent-live:web.1.14 [0ms] (cache) push: 28270
2014-08-04T20:03:58.201865+00:00 app[web.1]: apnagent:agent-live:web.1.14 [0ms] (encoder) write: 28271
2014-08-04T20:03:58.202116+00:00 app[web.1]: apnagent:agent-live:web.1.14 [1ms] (gateway) write: 28271
2014-08-04T20:03:58.202255+00:00 app[web.1]: apnagent:agent-live:web.1.14 [0ms] (cache) push: 28271
2014-08-04T20:03:58.204973+00:00 app[web.1]: apnagent:agent-live:web.1.14 [2ms] (encoder) write: 28272
2014-08-04T20:03:58.205635+00:00 app[web.1]: apnagent:agent-live:web.1.14 [1ms] (gateway) write: 28272
2014-08-04T20:03:58.206080+00:00 app[web.1]: apnagent:agent-live:web.1.14 [1ms] (cache) push: 28272
2014-08-04T20:04:01.207280+00:00 app[web.1]: apnagent:agent-live:web.1.14 [2992ms] (cache) since: 28263
2014-08-04T20:04:01.217101+00:00 app[web.1]: apnagent:agent-live:web.1.14 [0ms] (queue) push: 28271
2014-08-04T20:04:01.215785+00:00 app[web.1]: apnagent:agent-live:web.1.14 [8ms] (queue) push: 28264
2014-08-04T20:04:01.215850+00:00 app[web.1]: apnagent:agent-live:web.1.14 [0ms] (queue) push: 28265
2014-08-04T20:04:01.215921+00:00 app[web.1]: apnagent:agent-live:web.1.14 [0ms] (queue) push: 28266
2014-08-04T20:04:01.215987+00:00 app[web.1]: apnagent:agent-live:web.1.14 [0ms] (queue) push: 28267
2014-08-04T20:04:01.216170+00:00 app[web.1]: apnagent:agent-live:web.1.14 [1ms] (queue) push: 28268
2014-08-04T20:04:01.216226+00:00 app[web.1]: apnagent:agent-live:web.1.14 [0ms] (queue) push: 28269
2014-08-04T20:04:01.216590+00:00 app[web.1]: apnagent:agent-live:web.1.14 [0ms] (queue) push: 28270
2014-08-04T20:04:01.217914+00:00 app[web.1]: apnagent:agent-live:web.1.14 [1ms] (queue) push: 28272
2014-08-04T20:04:01.217969+00:00 app[web.1]: apnagent:agent-live:web.1.14 [0ms] (gateway) reconnecting
2014-08-04T20:04:01.218187+00:00 app[web.1]: apnagent:agent-live:web.1.14 [1ms] (gateway) connecting - gateway.push.apple.com:2195

You can see that the gateway receives 6 bytes, then closes the gateway as it should. Next we see the queue pause, which is also expected. Then, two strange things happen. First, we see that after the queue has been paused, the gateway is still being written to, which shouldn't happen. The next strange thing is that although the reconnect attempt happens 3 seconds later, it never finishes connecting or throws an error. That's the last line in our log for that worker after it continued to run for another hour or so. I'm going to experiment with setTimeout and the timeout event on the tls socket to see if that event gets called in this case. Very strange stuff...

Edit: As you can see I added some additional logging, which you can find on my fork if it's of interest.

bloudermilk commented 10 years ago

I added a connection timeout to my fork, which fixed this issue. Let me know if you'd like a PR.