hellozedan / node-xmpp-bosh

Automatically exported from code.google.com/p/node-xmpp-bosh
0 stars 0 forks source link

Socket error which makes NXB crash #13

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
Hi,

NXB crashes a lot since I have installed in production. Every 10 minutes I get 
a crash (or less or more).

Here is the crash log:

root@jappix:/etc/init.d# NODE_PATH=/usr/local/lib/node /usr/local/bin/bosh
+-------------------------------------------------------------------------------
--------------------------------------+
| Starting BOSH server 'v0.1.15' on 'http://:::5280/^\/http-bind(\/)?$/' at 
'Sun Jun 05 2011 17:47:34 GMT+0000 (UTC)' |
+-------------------------------------------------------------------------------
--------------------------------------+

net.js:693
  if (typeof self.fd === 'number') throw new Error('Socket already opened');
                                         ^
Error: Socket already opened
    at Socket.connect (net.js:693:42)
    at /usr/local/lib/bosh/src/srv.js:107:20
    at tryConnect (/usr/local/lib/bosh/src/srv.js:115:5)
    at /usr/local/lib/bosh/src/srv.js:127:21
    at /usr/local/lib/bosh/src/srv.js:55:21
    at /usr/local/lib/bosh/src/srv.js:68:21
    at /usr/local/lib/bosh/src/srv.js:85:9
    at dns.js:192:30
    at IOWatcher.callback (dns.js:74:15)
root@jappix:/etc/init.d# 

Hope it will be fixed soon, we need it quickly to keep our service stable ;)

Original issue reported on code.google.com by vanaryon on 5 Jun 2011 at 5:59

GoogleCodeExporter commented 8 years ago

Original comment by vanaryon on 5 Jun 2011 at 6:00

GoogleCodeExporter commented 8 years ago
I am currently checking if this is not due to your last modification to disable 
TLS on defined domains, because actually, I defined two local domains in the 
configuration.

Because it is a socket issue, it may be because of this.

Original comment by vanaryon on 5 Jun 2011 at 6:14

GoogleCodeExporter commented 8 years ago
Apart from this issue, which is recurrent, I must admit that NXB is running 
pretty fine. It is 2 times faster than ejabberd's internal BOSH and Punjab, and 
only consumes ~20Mio for ~100 active sessions, great :)

Original comment by vanaryon on 5 Jun 2011 at 6:29

GoogleCodeExporter commented 8 years ago
Mhh, it is not due to the no-TLS patch, because it is still crashing without 
using its option.

Anyway, I notice that NXB always crash around 10-11-12 minutes of uptime. You 
can have a look to it right here: https://bind.jappix.com/

Original comment by vanaryon on 5 Jun 2011 at 6:34

GoogleCodeExporter commented 8 years ago
I doubt that the TLS changes could cause this. However, does it work fine 
without the TLS fixes?

Can you enable debug logging? Does this happen when you start NXB or some time 
later?

The error seems to be pointing at the DNS resolution bit (and subsequent 
connecting). It would be good to know which domain this happens for.

Original comment by dhruvb...@gmail.com on 5 Jun 2011 at 6:36

GoogleCodeExporter commented 8 years ago
It's strange that it crashes with < 50 users since I've tested with over 5k. 
Either ways, it would be interesting to see the exact domian during which the 
resolution causes it to crash. If it can be reproduced, then fixing it would be 
relatively simpler.

Original comment by dhruvb...@gmail.com on 5 Jun 2011 at 6:38

GoogleCodeExporter commented 8 years ago
Okay, right. Will enable DEBUG mode and let you know.

Original comment by vanaryon on 5 Jun 2011 at 6:41

GoogleCodeExporter commented 8 years ago
Hey, got it!

Here is the log for what makes NXB crash:

DEBUG Sun Jun 05 2011 18:50:17 GMT+0000 (UTC) BOSH::Processing 'POST' request 
at location: /http-bind
DEBUG Sun Jun 05 2011 18:50:17 GMT+0000 (UTC) BOSH::Processing request: <body 
content="text/xml; charset=utf-8" secure="true" to="osvaldobarrera.com.ar" 
xml:lang="en" xmpp:version="1.0" ver="1.6" xmlns:xmpp="urn:xmpp:xbosh" 
rid="2995823658399267" wait="60" hold="1" 
xmlns="http://jabber.org/protocol/httpbind"/>
DEBUG Sun Jun 05 2011 18:50:17 GMT+0000 (UTC) BOSH::Session creation
DEBUG Sun Jun 05 2011 18:50:17 GMT+0000 (UTC) 
BOSH::828f73b5-03a6-4946-a6e4-1c6baf2a4fd6::adding a response object. Holding 0 
response objects
DEBUG Sun Jun 05 2011 18:50:17 GMT+0000 (UTC) 
BOSH::828f73b5-03a6-4946-a6e4-1c6baf2a4fd6::setting a timeout of '80' sec
DEBUG Sun Jun 05 2011 18:50:17 GMT+0000 (UTC) 
BOSH::828f73b5-03a6-4946-a6e4-1c6baf2a4fd6::enqueue_response
DEBUG Sun Jun 05 2011 18:50:17 GMT+0000 (UTC) 
BOSH::828f73b5-03a6-4946-a6e4-1c6baf2a4fd6::Merging with response at index: -1
DEBUG Sun Jun 05 2011 18:50:17 GMT+0000 (UTC) LOOKUP 
SERVICE::try_connect_SRV_lookup
DEBUG Sun Jun 05 2011 18:50:17 GMT+0000 (UTC) 
BOSH::828f73b5-03a6-4946-a6e4-1c6baf2a4fd6::Returning response object with rid: 
2995823658399267
DEBUG Sun Jun 05 2011 18:50:17 GMT+0000 (UTC) 
BOSH::828f73b5-03a6-4946-a6e4-1c6baf2a4fd6::Holding 0 response objects
DEBUG Sun Jun 05 2011 18:50:17 GMT+0000 (UTC) 
BOSH::828f73b5-03a6-4946-a6e4-1c6baf2a4fd6::pop_and_send: ro:true, 
state.pending.length: 1
DEBUG Sun Jun 05 2011 18:50:17 GMT+0000 (UTC) 
BOSH::828f73b5-03a6-4946-a6e4-1c6baf2a4fd6::send_no_requeue, ro valid: true
DEBUG Sun Jun 05 2011 18:50:17 GMT+0000 (UTC) 
BOSH::828f73b5-03a6-4946-a6e4-1c6baf2a4fd6::send_no_requeue, rid: 
2995823658399267
DEBUG Sun Jun 05 2011 18:50:17 GMT+0000 (UTC) 
BOSH::828f73b5-03a6-4946-a6e4-1c6baf2a4fd6::send_no_requeue:writing response: 
<body xmlns="http://jabber.org/protocol/httpbind" 
stream="ecf4d292-524d-46d1-9c4e-486e6b4a7131" 
sid="828f73b5-03a6-4946-a6e4-1c6baf2a4fd6" wait="60" polling="35" 
inactivity="70" requests="2" hold="1" from="osvaldobarrera.com.ar" 
content="text/xml; charset=utf-8" xmpp:restartlogic="true" 
xmlns:xmpp="urn:xmpp:xbosh" window="2"/>
DEBUG Sun Jun 05 2011 18:50:17 GMT+0000 (UTC) 
BOSH::828f73b5-03a6-4946-a6e4-1c6baf2a4fd6::send_pending_responses::state.pendin
g.length: 0

net.js:693
  if (typeof self.fd === 'number') throw new Error('Socket already opened');
                                         ^
Error: Socket already opened
    at Socket.connect (net.js:693:42)
    at /usr/local/lib/bosh/src/srv.js:107:20
    at tryConnect (/usr/local/lib/bosh/src/srv.js:115:5)
    at /usr/local/lib/bosh/src/srv.js:127:21
    at /usr/local/lib/bosh/src/srv.js:55:21
    at /usr/local/lib/bosh/src/srv.js:68:21
    at /usr/local/lib/bosh/src/srv.js:85:9
    at dns.js:192:30
    at IOWatcher.callback (dns.js:74:15)
root@jappix:/home/vanaryon# 

Original comment by vanaryon on 5 Jun 2011 at 6:53

GoogleCodeExporter commented 8 years ago
And here is the SRV query for this server (note that 'localhost' appears in the 
reply, might be the reason of the issue, right?):

vanaryon@ocarina:~$ dig -t SRV _xmpp-client._tcp.osvaldobarrera.com.ar

; <<>> DiG 9.7.3 <<>> -t SRV _xmpp-client._tcp.osvaldobarrera.com.ar
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 55801
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;_xmpp-client._tcp.osvaldobarrera.com.ar. IN SRV

;; ANSWER SECTION:
_xmpp-client._tcp.osvaldobarrera.com.ar. 3600 IN SRV 5 0 5222 
xmpp.hugoosvaldobarrera.com.ar.
_xmpp-client._tcp.osvaldobarrera.com.ar. 3600 IN SRV 60 0 5222 localhost.

;; Query time: 202 msec
;; SERVER: 192.168.1.1#53(192.168.1.1)
;; WHEN: Sun Jun  5 20:54:42 2011
;; MSG SIZE  rcvd: 136

Original comment by vanaryon on 5 Jun 2011 at 6:55

GoogleCodeExporter commented 8 years ago
Would it be possible for you to fix it in the following hour? The stability of 
our service depends on that :S I will be away for 3 days after that, and I have 
a hour remaining. Thanks a lot ;)

Original comment by vanaryon on 5 Jun 2011 at 6:57

GoogleCodeExporter commented 8 years ago
I'm trying to reproduce this locally using this command:
node basic.js --username=xxx@osvaldobarrera.com.ar --password=yy

But, it doesn't crash here. I am using google DNS (8.8.8.8)

;; ANSWER SECTION:
_xmpp-client._tcp.osvaldobarrera.com.ar. 3397 IN SRV 5 0 5222 
xmpp.hugoosvaldobarrera.com.ar.
_xmpp-client._tcp.osvaldobarrera.com.ar. 3397 IN SRV 60 0 5222 localhost.

Which is the same as your output. I've pushed a patch that I *guess* may fix 
it, but I can't be sure. Is there any way I can induce a crash on your machine?

Original comment by dhruvb...@gmail.com on 5 Jun 2011 at 7:09

GoogleCodeExporter commented 8 years ago
This guy's SRV records are messed up!! He has localhost in his SRV record!! 
Which means that if the first connection fails, it will try to connect to 
localhost.

Original comment by dhruvb...@gmail.com on 5 Jun 2011 at 7:13

GoogleCodeExporter commented 8 years ago
Updated from SVN, your patch is now running!

Feel free to try to make NXB crash on Jappix.com ;)

Original comment by vanaryon on 5 Jun 2011 at 7:14

GoogleCodeExporter commented 8 years ago
Yep, that's what I have thought about. It does not make any sense, I don't 
really understand why using this :o

Original comment by vanaryon on 5 Jun 2011 at 7:15

GoogleCodeExporter commented 8 years ago
okay, I think I was able to crash it on your machine. Now for my machine.

Original comment by dhruvb...@gmail.com on 5 Jun 2011 at 7:15

GoogleCodeExporter commented 8 years ago
Mhh, seems it has crashed once again with your patch. It does not fix it :S

Original comment by vanaryon on 5 Jun 2011 at 7:16

GoogleCodeExporter commented 8 years ago
hmm... Seems to be crashing every 15 min out there

Original comment by dhruvb...@gmail.com on 5 Jun 2011 at 7:33

GoogleCodeExporter commented 8 years ago
Okay, commit #264 should *hopefully* be a correct fix for this. Do let me know 
if it works out.

So, here is the problem:

* On your machine, you have an XMPP server running on localhost whereas I don't
* The SRV lookup for the domain under question returns 2 records (as seen in 
comment #11)
* Due to some bug, the localhost record is tried first
* On your machine, it connects whereas on mine it doesn't
* Due to a bug in the srv.js module, the second SRV lookup is tried 
"irrespective" of whether the 1st one succeeded or failed.
* On your machine, you get a "connection tried for already connected socket" 
but on mine you don't because I don't have anything running on 
localhost:5222!!!!
* On mine, you first get an 'error' event on the socket (terminates the BOSH 
session) and subsequently, the connection to the next address succeeds, but it 
is too late by now. This socket remains open forever (socket leak) or till 
terminated by the upstream XMPP server.

Original comment by dhruvb...@gmail.com on 6 Jun 2011 at 6:44

GoogleCodeExporter commented 8 years ago
Okay, thanks for the patch. I will apply it next Wednesday, or later (maybe 
during the night, to avoid disconnecting all our users).

I will tell you if all is okay when tested!

Original comment by vanaryon on 6 Jun 2011 at 3:57

GoogleCodeExporter commented 8 years ago
Have updated Jappix.com's NXB, tried to crash it with the 'localhost' SRV 
records, everything went fine.

I think this bug can be fixed, but I am not sure. NXB crashed 5 times today, 
after running more than 26 hours without any crash.

I will report you the logs when I will get it, hope it is the same crash reason 
everytime, and not a wide range of new bugs ;)

Original comment by vanaryon on 7 Jun 2011 at 4:22

GoogleCodeExporter commented 8 years ago
This has been fixed.
Please re-open if you see the same (or similar) exception stack trace again.

Original comment by dhruvb...@gmail.com on 9 Jun 2011 at 8:18