XEP-0198: Support for `location` when resuming the stream using Stream Management

woj-tek commented 3 years ago

[X] I understood that this template is only for feature requests and not for bug reports

[X] I have cross-checked this overview https://github.com/monal-im/Monal/issues/322 as well as filtered for related labels https://github.com/monal-im/Monal/labels

Describe your feature

Currently it seems that location attribute is not respected by Monal: https://github.com/monal-im/Monal/blob/b357abd60c4154f743d25e8db5819f8f511f0902/Monal/Classes/xmpp.m#L1765 Would it be possible to add handling of the location attribute and during resumption try to use that address first as per specification:

The element MAY include a 'location' attribute to specify the server's preferred IP address or hostname (optionally with a port) for reconnection, in the form specified in Section 4.9.3.19 of RFC 6120 (i.e., "domainpart:port", where IPv6 addresses are enclosed in square brackets "[...]" as described in RFC 5952 [6]); if reconnection to that location fails, the standard XMPP connection algorithm specified in RFC 6120 applies.

Echolon commented 3 years ago

Thanks for suggesting!

woj-tek commented 3 years ago

To add some context - I'm from Tigase team and we run public server tigase.im (sure.im/jabber.today) which is a cluster installation (behind load balancer). Due to lack of support for location attribute Monal users can't resume sessions correctly most of the time as they usually reconnect on different cluster node which isn't aware of the previous session and the request fails.

tmolitor-stud-tu commented 3 years ago

Yes, it would be possible, but it's low priority because both, ejabberd and prosody don't support the location property anyways,. I guess openfire does not support/use it as well.

Echolon commented 3 years ago

@guusdk may you comment on this?

guusdk commented 3 years ago

For Stream Management to work properly in an Openfire cluster, the client must resume the stream on the same cluster node. To facilitate this, the location attribute was added in version 4.5.0 of Openfire.

woj-tek commented 3 years ago

I think the problem boils down to "popular community demand" - majority of current public deployments seems to use single-node/single-server setup, in which case it doesn't matter. If there is a clustering involve it simply breaks most of the time as sessions are usually cached on the same node that the connection was established (which simplifies a lot many things, due to that we also use see-other-host to group same user connections on the same node, but that's OT :-) ).

Currently we are pondering simply disabling announcing StreamManagement to Monal users, though detecting it would be somewhat tricky (considering SM negotiation it would simply be based on resource name, sigh), but still better that constantly failing to resume the session because app knocks on the wrong machine...

guusdk commented 3 years ago

In my reasoning, features like these - those that are primarily desirable by organisations that make use a professional (clustered) environment - make for good candidates for bounties, or otherwise commissioned type of work.

I'm unsure what amount of budget would be needed to fix this client-sided, but it is hard to imagine that it would be significantly more than the budget that such an organisation would need to decide if said work is to be commissioned in the first place.

Given that the Monal project signed up to the Github Sponsors program, I suspect that there is an easy way here to get this feature realized.

(Please note that I'm in no way, shape or form associated with the Monal project, and I'm in no way trying to express a point of view of the Monal project team on this. I'm only sharing insight and experience from other OSS-based projects that I'm active in, that work for me).

woj-tek commented 3 years ago

I would disagree in a way - professional/paid environments tend to use software from single provider.

I think that our tigase.im public installation is somewhat of an outlier here and our desire to provide something with high-availability (hence using cluster) for the benefit of the users is simply unusual and thus warrant less focus, which is also understandable in a way. Now, we already addressed it somewhat on our end (we basically drop such dangling sessions more pre-emptively) but this could simply mean less than optimal ergonomics of Monal users using tigase.im public servers. Which are probably a (tiny?) minority :-)

weiss commented 3 years ago

we already addressed it somewhat on our end

I would think you'd need to cope with clients attempting to resume on other nodes anyway, given the spec says that "if reconnection to that location fails, the standard XMPP connection algorithm specified in RFC 6120 applies." And reconnection might of course always fail due to random network hickups or whatever.

BTW, I've always assumed the use case for location would be telling clients to resume on node B when A goes down. Hence I've been unhappy to make that decision on <enable/> and not even being able to update the location on <resume/>, which renders the feature pretty much useless for my use case. But yes, it obviously makes sense for telling clients to stick to the same node if you have no way to share the state required for resumption.

woj-tek commented 3 years ago

As we don't share session state across the nodes thus if the client connects do different location (that doesn't know about the previous session) then the resumption would obviously fail thus the client would simply proceed with regular session establishment. In our case the slight issue was the dangling sessions that weren't resumed.

I would agree that preemptive use of location would be somewhat inconvenient without the update as cluster could be quite dynamic. @weiss - does ejabberd clustering has shared state and allows resuming on whichever node?

weiss commented 3 years ago

does ejabberd clustering has shared state and allows resuming on whichever node?

The state is copied over from the previous node during resumption. So if a node goes into a planned downtime, one option is to have that node stop accepting new connections, then kick remaining clients and maybe wait for a few minutes to give the clients the chance to resume on another node.

tmolitor-stud-tu commented 3 years ago

@woj-tek dangling sessions mean you have two sessions bound to the same xmpp resource? the new one opened by the client and the old one still being XEP-0198 hibernated? Isn't that against the RFC mandating all resources have to be unique? (in the namespace of a bare-jid of course)

Echolon commented 2 years ago

@woj-tek kind reminder on the discussing here. May you reply to the previous message?

woj-tek commented 2 years ago

Sorry @Echolon it slipped in the notifications.

@tmolitor-stud-tu - no, we don't have multiple resources bound to the same resource and we enforced that across the cluster. However, having the client reconnect and intent to resume the session on different cluster node still makes it impossible to resume the session (thus, extending the connection time).

monal-im / Monal

XEP-0198: Support for `location` when resuming the stream using Stream Management #737

Describe your feature