kirm / sip.js

Session Initiation Protocol for node.js
MIT License
427 stars 171 forks source link

Multiple 302 Response loops after ACK #34

Closed ghost closed 10 years ago

ghost commented 11 years ago

Hey kirm-

First of all, awesome lib man. It's working well at fairly high CPS levels I did some performance testing earlier this morning and saw ~2500 CPS on a Quad Core@2.0GHZ box.

I ran into an interesting issue tonight while investigating several "time outs" that I was receiving. I've isolated the problem to a "302 response loop".

Please see: http://snag.gy/WUMgg.jpg which depicts the "interesting" call flow in WireShark, as you can see we are continually sending a 302 after the ACK is received for this particular call. As far as I can tell the box is not under significant load when this occurs.

I've attached the folllowing pcap (which you can download here: https://filetea.me/default/#t1sYw6V9WWCT2mcSGumSpxcCA)

local_trace.cap - This contains a capture running on the box during a perf test Note, the one with start time of: 8.421337 size 12, the flow is very strange: INVITE ->, <- 302, -> ACK, -> INVITE, <- 302, <- 302, .....

Any thoughts on this? Thanks in advance!

avimar commented 11 years ago

(NOTE: That 2500 CPS is with only 1 node thread, so it was only using one of those cores.)

ghost commented 11 years ago

Link to pcap here: http://ge.tt/9JjC8kd/v/0?c

Thanks!

avimar commented 11 years ago

The only clustering I'm seeing in local_trace.pcap is where the call-id happens to be the same (it's not a very collision-resistant call id!) and therefore wireshark incorrectly groups it together, even though the tag is different.

(e.g.: "call-id: 265486" - perhaps we need a better call-id generator in this perf testing.)

Can you be more specific?

ghost commented 11 years ago

Sorry about the ambiguity... please see a more detailed pcap here: http://ge.tt/3B4uHkd/v/0?c

Find call with start time 108.288285 and look at the "Call Flow", packet size is 34 and you'll see what seems to be erroneous call flow. When the ACK is received a 302 is continually sent.

kirm commented 11 years ago

It seems there two separate issues in discussion here. One is that some server transactions seem to ignore ACK requests and keep sending responses. I looked at the trace @sl1ngsh0t provided and i think i found the faulty call. I need some more time to research the issue.

Second issue is performance and usage of sip.js with node.js cluster facility. sip.js does have some issues with cluster. The reason is that sip.js maintains quite lot of state in its transaction layer. However if your application doesn't need to mantain transaction state you can work directly with sip.js trasport layer and use cluster. It works quite well in this case and benefit from multiiple cores utilization.

ghost commented 11 years ago

Awesome news on the first issue.

In regards to the second issue, we're not using cluster atm. We ARE planning on it though.

avimar commented 11 years ago

Doh! The load balancer is returning ACKs to the wrong machine. Sorry!

kirm commented 11 years ago

There is a load balancer in the setup? Could you describe your setup? Or the issue has been resolved on your side?

avimar commented 11 years ago

Citrix Netscaler LB -> two instances that had node sip.js running. The main issue we were seeing was there was no persistence was set on the Netscaler so it was passing the ACKs back to the wrong node sip.js instance. Since there was no ack, sip.js resent the responses, as it's supposed to....

So a problem with the LB, not with sip.js. I think.