Closed GoogleCodeExporter closed 8 years ago
Original comment by vanaryon
on 5 Jun 2011 at 6:00
I am currently checking if this is not due to your last modification to disable
TLS on defined domains, because actually, I defined two local domains in the
configuration.
Because it is a socket issue, it may be because of this.
Original comment by vanaryon
on 5 Jun 2011 at 6:14
Apart from this issue, which is recurrent, I must admit that NXB is running
pretty fine. It is 2 times faster than ejabberd's internal BOSH and Punjab, and
only consumes ~20Mio for ~100 active sessions, great :)
Original comment by vanaryon
on 5 Jun 2011 at 6:29
Mhh, it is not due to the no-TLS patch, because it is still crashing without
using its option.
Anyway, I notice that NXB always crash around 10-11-12 minutes of uptime. You
can have a look to it right here: https://bind.jappix.com/
Original comment by vanaryon
on 5 Jun 2011 at 6:34
I doubt that the TLS changes could cause this. However, does it work fine
without the TLS fixes?
Can you enable debug logging? Does this happen when you start NXB or some time
later?
The error seems to be pointing at the DNS resolution bit (and subsequent
connecting). It would be good to know which domain this happens for.
Original comment by dhruvb...@gmail.com
on 5 Jun 2011 at 6:36
It's strange that it crashes with < 50 users since I've tested with over 5k.
Either ways, it would be interesting to see the exact domian during which the
resolution causes it to crash. If it can be reproduced, then fixing it would be
relatively simpler.
Original comment by dhruvb...@gmail.com
on 5 Jun 2011 at 6:38
Okay, right. Will enable DEBUG mode and let you know.
Original comment by vanaryon
on 5 Jun 2011 at 6:41
Hey, got it!
Here is the log for what makes NXB crash:
DEBUG Sun Jun 05 2011 18:50:17 GMT+0000 (UTC) BOSH::Processing 'POST' request
at location: /http-bind
DEBUG Sun Jun 05 2011 18:50:17 GMT+0000 (UTC) BOSH::Processing request: <body
content="text/xml; charset=utf-8" secure="true" to="osvaldobarrera.com.ar"
xml:lang="en" xmpp:version="1.0" ver="1.6" xmlns:xmpp="urn:xmpp:xbosh"
rid="2995823658399267" wait="60" hold="1"
xmlns="http://jabber.org/protocol/httpbind"/>
DEBUG Sun Jun 05 2011 18:50:17 GMT+0000 (UTC) BOSH::Session creation
DEBUG Sun Jun 05 2011 18:50:17 GMT+0000 (UTC)
BOSH::828f73b5-03a6-4946-a6e4-1c6baf2a4fd6::adding a response object. Holding 0
response objects
DEBUG Sun Jun 05 2011 18:50:17 GMT+0000 (UTC)
BOSH::828f73b5-03a6-4946-a6e4-1c6baf2a4fd6::setting a timeout of '80' sec
DEBUG Sun Jun 05 2011 18:50:17 GMT+0000 (UTC)
BOSH::828f73b5-03a6-4946-a6e4-1c6baf2a4fd6::enqueue_response
DEBUG Sun Jun 05 2011 18:50:17 GMT+0000 (UTC)
BOSH::828f73b5-03a6-4946-a6e4-1c6baf2a4fd6::Merging with response at index: -1
DEBUG Sun Jun 05 2011 18:50:17 GMT+0000 (UTC) LOOKUP
SERVICE::try_connect_SRV_lookup
DEBUG Sun Jun 05 2011 18:50:17 GMT+0000 (UTC)
BOSH::828f73b5-03a6-4946-a6e4-1c6baf2a4fd6::Returning response object with rid:
2995823658399267
DEBUG Sun Jun 05 2011 18:50:17 GMT+0000 (UTC)
BOSH::828f73b5-03a6-4946-a6e4-1c6baf2a4fd6::Holding 0 response objects
DEBUG Sun Jun 05 2011 18:50:17 GMT+0000 (UTC)
BOSH::828f73b5-03a6-4946-a6e4-1c6baf2a4fd6::pop_and_send: ro:true,
state.pending.length: 1
DEBUG Sun Jun 05 2011 18:50:17 GMT+0000 (UTC)
BOSH::828f73b5-03a6-4946-a6e4-1c6baf2a4fd6::send_no_requeue, ro valid: true
DEBUG Sun Jun 05 2011 18:50:17 GMT+0000 (UTC)
BOSH::828f73b5-03a6-4946-a6e4-1c6baf2a4fd6::send_no_requeue, rid:
2995823658399267
DEBUG Sun Jun 05 2011 18:50:17 GMT+0000 (UTC)
BOSH::828f73b5-03a6-4946-a6e4-1c6baf2a4fd6::send_no_requeue:writing response:
<body xmlns="http://jabber.org/protocol/httpbind"
stream="ecf4d292-524d-46d1-9c4e-486e6b4a7131"
sid="828f73b5-03a6-4946-a6e4-1c6baf2a4fd6" wait="60" polling="35"
inactivity="70" requests="2" hold="1" from="osvaldobarrera.com.ar"
content="text/xml; charset=utf-8" xmpp:restartlogic="true"
xmlns:xmpp="urn:xmpp:xbosh" window="2"/>
DEBUG Sun Jun 05 2011 18:50:17 GMT+0000 (UTC)
BOSH::828f73b5-03a6-4946-a6e4-1c6baf2a4fd6::send_pending_responses::state.pendin
g.length: 0
net.js:693
if (typeof self.fd === 'number') throw new Error('Socket already opened');
^
Error: Socket already opened
at Socket.connect (net.js:693:42)
at /usr/local/lib/bosh/src/srv.js:107:20
at tryConnect (/usr/local/lib/bosh/src/srv.js:115:5)
at /usr/local/lib/bosh/src/srv.js:127:21
at /usr/local/lib/bosh/src/srv.js:55:21
at /usr/local/lib/bosh/src/srv.js:68:21
at /usr/local/lib/bosh/src/srv.js:85:9
at dns.js:192:30
at IOWatcher.callback (dns.js:74:15)
root@jappix:/home/vanaryon#
Original comment by vanaryon
on 5 Jun 2011 at 6:53
And here is the SRV query for this server (note that 'localhost' appears in the
reply, might be the reason of the issue, right?):
vanaryon@ocarina:~$ dig -t SRV _xmpp-client._tcp.osvaldobarrera.com.ar
; <<>> DiG 9.7.3 <<>> -t SRV _xmpp-client._tcp.osvaldobarrera.com.ar
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 55801
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;_xmpp-client._tcp.osvaldobarrera.com.ar. IN SRV
;; ANSWER SECTION:
_xmpp-client._tcp.osvaldobarrera.com.ar. 3600 IN SRV 5 0 5222
xmpp.hugoosvaldobarrera.com.ar.
_xmpp-client._tcp.osvaldobarrera.com.ar. 3600 IN SRV 60 0 5222 localhost.
;; Query time: 202 msec
;; SERVER: 192.168.1.1#53(192.168.1.1)
;; WHEN: Sun Jun 5 20:54:42 2011
;; MSG SIZE rcvd: 136
Original comment by vanaryon
on 5 Jun 2011 at 6:55
Would it be possible for you to fix it in the following hour? The stability of
our service depends on that :S I will be away for 3 days after that, and I have
a hour remaining. Thanks a lot ;)
Original comment by vanaryon
on 5 Jun 2011 at 6:57
I'm trying to reproduce this locally using this command:
node basic.js --username=xxx@osvaldobarrera.com.ar --password=yy
But, it doesn't crash here. I am using google DNS (8.8.8.8)
;; ANSWER SECTION:
_xmpp-client._tcp.osvaldobarrera.com.ar. 3397 IN SRV 5 0 5222
xmpp.hugoosvaldobarrera.com.ar.
_xmpp-client._tcp.osvaldobarrera.com.ar. 3397 IN SRV 60 0 5222 localhost.
Which is the same as your output. I've pushed a patch that I *guess* may fix
it, but I can't be sure. Is there any way I can induce a crash on your machine?
Original comment by dhruvb...@gmail.com
on 5 Jun 2011 at 7:09
This guy's SRV records are messed up!! He has localhost in his SRV record!!
Which means that if the first connection fails, it will try to connect to
localhost.
Original comment by dhruvb...@gmail.com
on 5 Jun 2011 at 7:13
Updated from SVN, your patch is now running!
Feel free to try to make NXB crash on Jappix.com ;)
Original comment by vanaryon
on 5 Jun 2011 at 7:14
Yep, that's what I have thought about. It does not make any sense, I don't
really understand why using this :o
Original comment by vanaryon
on 5 Jun 2011 at 7:15
okay, I think I was able to crash it on your machine. Now for my machine.
Original comment by dhruvb...@gmail.com
on 5 Jun 2011 at 7:15
Mhh, seems it has crashed once again with your patch. It does not fix it :S
Original comment by vanaryon
on 5 Jun 2011 at 7:16
hmm... Seems to be crashing every 15 min out there
Original comment by dhruvb...@gmail.com
on 5 Jun 2011 at 7:33
Okay, commit #264 should *hopefully* be a correct fix for this. Do let me know
if it works out.
So, here is the problem:
* On your machine, you have an XMPP server running on localhost whereas I don't
* The SRV lookup for the domain under question returns 2 records (as seen in
comment #11)
* Due to some bug, the localhost record is tried first
* On your machine, it connects whereas on mine it doesn't
* Due to a bug in the srv.js module, the second SRV lookup is tried
"irrespective" of whether the 1st one succeeded or failed.
* On your machine, you get a "connection tried for already connected socket"
but on mine you don't because I don't have anything running on
localhost:5222!!!!
* On mine, you first get an 'error' event on the socket (terminates the BOSH
session) and subsequently, the connection to the next address succeeds, but it
is too late by now. This socket remains open forever (socket leak) or till
terminated by the upstream XMPP server.
Original comment by dhruvb...@gmail.com
on 6 Jun 2011 at 6:44
Okay, thanks for the patch. I will apply it next Wednesday, or later (maybe
during the night, to avoid disconnecting all our users).
I will tell you if all is okay when tested!
Original comment by vanaryon
on 6 Jun 2011 at 3:57
Have updated Jappix.com's NXB, tried to crash it with the 'localhost' SRV
records, everything went fine.
I think this bug can be fixed, but I am not sure. NXB crashed 5 times today,
after running more than 26 hours without any crash.
I will report you the logs when I will get it, hope it is the same crash reason
everytime, and not a wide range of new bugs ;)
Original comment by vanaryon
on 7 Jun 2011 at 4:22
This has been fixed.
Please re-open if you see the same (or similar) exception stack trace again.
Original comment by dhruvb...@gmail.com
on 9 Jun 2011 at 8:18
Original issue reported on code.google.com by
vanaryon
on 5 Jun 2011 at 5:59