Clustering and joining calls

benlangfeld commented 10 years ago

From @mmcguinn:

Sorry for the delay on this, but running through some of our normal callflow scenarios I have another potential concern. The flow goes something like this:

CallA calls into an IVR menu

CallA uses the IVR to place a call to B, which must first be confirmed by B before joining

The IVR requests a call out to B (now CallB) and the IVR asks if B would like to accept the call.

B says yes (or presses 1 or whathaveyou) and then CallA and CallB are joined.

With a gateway arrangement my concern is that because CallB is not dialed with a join request (as it must interact with the IVR first) the gateway will be free to hand off the dial request to a different node than is handing callA. The join then becomes more difficult as it would require the two nodes to know about each other, and communicate in order to complete the join.

Do you have any thoughts on to deal with this sort of issue? Perhaps an optional hint inside dial requests (similar to the existing join option in Rayo) specifically to request a gateway place the call on the same node as another call? Both these solutions leaving something to be desired in my mind, given that the goal of having such a gateway is to isolate clients and nodes from these sorts of concerns.

benlangfeld commented 10 years ago

So I'll start by outlining my understanding of how Voxeo PRISM clusters (intended) to handle this:

Commands re-written to nodes would also rewrite the uri attributes of <join> commands, in addition to the outer IQ#to indicating the target of the call.
Nodes would be aware of whether or not they were handling the call requested to be joined, as well as the target call of the command.
In cases where the node executing the command was not handling both calls, it would originate another channel to the node hosting the 2nd leg which would implicitly join the calls between nodes.
Alternatively, calls may be live-migrated between nodes to group them together for the purposes of a local join.

The inter-node bridging seems simple enough It does require consideration of the security implications of the implicit join - these calls would need to be authenticated appropriately between the nodes. I think we can realistically include that in this spec (although at a reasonably high level) and expect nodes and deployments to implement it correctly.

I don't think we can expect nodes to implement call migration; this is significantly more complex and I'm not sure there's any standardisation of it generally. It has overlaps with failover should a node die with live calls. Additionally, I am somewhat uncomfortable about the ramifications of exposing semi-explicit control of load-balancing to untrusted clients; it's conceivable that a client could sufficiently weight calls onto individual nodes beyond the control of normal load-balancing to take an individual node down, causing major cascading failure.

I'm left in favour of specifying the semantics of inter-node joins at a high level. Thoughts?

crienzo commented 10 years ago

I had to solve a similar problem at my last job by using inter-node joins. It works fine. We can also reduce the need for this by using dial w/ nested join. Then, the gateway could inspect the request and put both calls on the same node.

crienzo commented 10 years ago

My only other thought was there could be some kind of "call group" hint you could assign to outbound calls and that are assigned to inbound calls. Then calls in the same group could have preference to the same node (though not required).

benlangfeld commented 10 years ago

I'd rather steer away from the "call group" / node hint idea since it marks a clear explicit DoS vector. I'd also insist on making any implicit grouping by nested joins strictly an optional (MAY) optimisation, since it'll require some more advanced load-balancing/monitoring to avoid DoS attacks.

I'll write this up today.

benlangfeld commented 10 years ago

Written up here. Critique please, @mmcguinn / @crienzo? Have I missed anything out here? Have I been too vague? Perhaps we need to wait until this has an implementation to be sure?

crienzo commented 10 years ago

I disagree about possible DoS vector. A gateway implementation would not be required to use any hints, especially if it decided there were too many calls on that server. It just seems unfortunate that most screened follow me calls will be joined on different servers.

crienzo commented 10 years ago

I don't think it's in the scope of this document to require a secure channel between nodes. Are you attempting to define how two different rayo node implementations can join calls?

benlangfeld commented 10 years ago

I'm happy for nested joins to be used as a hint for co-location on a node and included that in the text. I guess we could expand that to a general hint with the same semantics, and as you say make it optional for the gateway to comply. I'll work that in.

As for the secure channel between nodes, I mean purely that nodes should ensure that only calls from trusted nodes are allowed to behave as these bridge proxies which automatically join a real call. Implementations are free to utilise any auth they like, be it simple firewalls, SIP Digest, etc. I'll make this clearer in the text, since I suspect you believed I meant encryption.

benlangfeld commented 10 years ago

I've addressed both of those comments. This is starting to look better. Rendered version at http://ci.mojolingo.com/job/Rayo%20Spec/283/artifact/extensions/rayo-clustering.html

crienzo commented 10 years ago

Looks good to me. Michael, any comments?

mmcguinn commented 10 years ago

Looks good to me as well, that covers the only real issue I came up with going over flows.

On Tue, Apr 22, 2014 at 4:06 PM, Chris Rienzo notifications@github.comwrote:

Look good to me. Michael, any comments?

— Reply to this email directly or view it on GitHubhttps://github.com/rayo/xmpp/pull/93#issuecomment-41088914 .

rayo / xmpp

Clustering and joining calls #93