Suggestion: Support session resurrection

wilhelmy commented 9 years ago

One of IRC's biggest drawbacks is that sessions are bound to the lifetime of a single TCP connection. This is what makes it unsuitable for the mobile networks of the current decade.

I suggest the following, provided that the client presents an SSL client certificate: Implement a "RESURRECT old-nickname" command as an alternative to the normal NICK/USER on connect. Provided that the client certificate matches the client certificate on the session of old-nickname, old-nickname's connection will be terminated and replaced by the new TCP connection. Since the client already knows the state, this eliminates the current nasty quit/join as well as the retransmission of most of the state.

Things to consider:

How should the server determine which data has not arrived at the client the previous time? Is there a portable way to determine how much data on a TCP connection was not yet ACKed by the client and to simply resend this data over the new socket? (not quite so easy, since SSL and padding is getting in the way. The easiest way might be to send data, and check whether the previous data was ACKed before subsequent calls to write(2))
What do we do if the hostmask changed because the client switched from 3G to wifi or vice versa.

Comments welcome, since this is a very rough draft (but something I would consider to be very important).

Edit: fixed markdown.

attilamolnar commented 9 years ago

The biggest problem implementing this is , as you mention, knowing what was processed by the client previously. Relying on TCP acks is not portable and does not work for tls. The only way which surely works is if clients tell what they seen last and servers retain data after sending. Switching to SCTP seems like a saner idea to handle this.

wilhelmy commented 9 years ago

Good point, but then people on mobile networks might suffer from their ISP blocking UDP or SCTP. (I could imagine mobile providers do this, given that some of them block big port ranges and make excessive use of NAT as well).

M2Ys4U commented 9 years ago

I think SCTP is a non-starter, especially as many routers and middleboxen break it.

A UDP-based protocol may work, it will be interesting to see how well QUIC works at scale and see if we can leverage that.

In the short-term, it should be possible to develop some sot of ACK system.

Personally, I'd split this up in to two pieces. An ack CAP and a resume CAP.

I've quickly sketched out what these CAPs could look like below:

`ack` CAP

When active, servers MUST add an ack message tag with a monotonically increasing sequence id to every message. For example:

@ack=1 :irc.example.net 001 example :Welcome to the Internet Relay Network example!eg@example.com

@ack=2 :irc.example.net 002 example :Your host is irc.example.net, running version exampleircd-1.0.0

Clients MUST respond to each message containing an ack tag with an ACK command which takes two params, the second of which is optional.

In the single param case, the param is just the sequence id of the command to acknowledge. In the two param case, the two params indicate an inclusive range of sequence ids to acknowledge.

Clients MAY use the ack tag in a similar fashion to ask the server to acknowledge its messages, but they do not have to add the ack tag to every message.

If the batch CAP is enabled, servers MAY choose to only add the ack tag to the BATCH command instead of adding it to every command inside the batch. In this case the client MUST NOT acknowledge the batch until it has received the entire batch.

The ack CAP SHOULD be sticky.

`resume` CAP

To resume a session, the client sends a RESUME command instead of the usual NICK and USER commands. During capability negotiation, the server MUST NOT allow the session to be resumed if any sticky capabilities that were enabled previously on the session are not enabled on the new connection. The server MUST NOT allow a session to be resumed if the client did not previously authenticate itself using the sasl CAP.

If the server rejects resuming a session it MUST send the ERR_NORESUME numeric.

If the client receives the ERR_NORESUME numeric at any time during registration, it SHOULD attempt to register in the conventional manner, by sending the NICK and USER commands.

Once the client has resumed a session, previous connections MUST be closed by the server, and any data received on an old connection after a session has been resumed MUST be ignored. Servers MUST retransmit all messages that have not been ACKed on the previous connection to on new connection.

The resume CAP MUST NOT be enabled unless the ack CAP is also enabled.

kythyria commented 9 years ago

@M2Ys4U TCP guarantees in-order delivery, so the server doesn't need to add a tag to every message, thus saving bandwidth. I'm also not sure it's useful for the client to not request acknowledgement for each message, especially since that makes it ambiguous as to when the counter increments: for each message whether or not it needs acking, or only for each @ack=?

RESUME will be annoying to use in bouncers as suggested, since you'd have to have new credentials for each client rather than for each upstream connection. One possible solution is that a server can send RPL_RESUMEKEY with an identifier in it that can be used as the parameter to RESUME. Conceivably, a sufficiently long and random identifier could be used instead of normal registration. For example:

<< CAP LS 302
>> CAP * LS :multi-prefix sasl=PLAIN,EXTERNAL ack resume
>> CAP REQ :multi-prefix sasl ack resume
-- SASL auth --
<< NICK alice
<< USER alice 8 * :Alice Capulet
>> @ack=1 :irc.example.net RESUMEKEY 5eed627-374a-45e2-bb11-7b32bb10f83c 
>> @ack=2 :irc.example.net 001 :Welcome to the examplenet IRC network alice!alice@example.com

And when reconnecting:

<< RESUME 5eed627-374a-45e2-bb11-7b32bb10f83c
>> @ack=639 :bob!bob@montague.net Alice :What made you choose to vacation in Nowheresville anyway?
>> @ack=640 :irc.example.net 001 :Welcome back alice!alice@example.com

Message 639 being one that was unacked when the server detected that the connection was broken.

This also has a fringe benefit: if the client sends @ack=, then if it's something like a WHOIS then responses can have a @replyto= with the same identifier in, to pair it with the originating command. Admittedly, this will be even more annoying for bouncer writers, considering it'll need NAT-like mapping tables to work properly.

M2Ys4U commented 9 years ago

@kythyria Perhaps the client doesn't need to ACK each message, but the server still needs to tag each message, so it knows which to re-transmit on resumption.

I guess the ACK command could send just one id, and that would implicitly ACK all lower ids as well.

As for the id, it's only intended for messages tagged using @ack, But either would work.

A resume key sounds reasonable. Although I'm not sure I get the point of the WHOIS and @replyto comment

kythyria commented 9 years ago

I was thinking in terms of, every time you send a message for which you've requested an acknowledgement, add it to a buffer. When you get an acknowledgement for that message, remove it and every preceding message from the buffer. When you resume a connection, send the whole buffer.

You also need to keep track of the last sequence number you received, and only pass up to the rest of the application any messages with a higher one. Otherwise messages may be duplicated.

The bit about WHOIS and @replyto is just noting that you can use the sequence numbers as message IDs to tie query and reply together... provided you don't mind translating the numbers every time it passes from one connection to another. So it's probably not all that useful.

ShutterQuick commented 9 years ago

Does the server even need to keep track of what each client has received or not? Sounds a lot simpler to use client activity as markers. Each message gets an ID as you have described above, and upon resumption the buffer is sent from the point of last client activity (e.g. message, pong, etc.). Then it would be up to the client to match ID's of the received messages and pick what it's interested in.

The server MUST NOT allow a session to be resumed if the client did not previously authenticate itself using the sasl CAP.

@M2Ys4U

Why should this functionality require SASL?

kythyria commented 9 years ago

@ShutterQuick That's not reliable: messages can cross paths in mid-flight:

Server sees:
S: 1
S: 2
C: A
** Connection failure **
Client sees:
S: 1
C: A
C: B
** Connection failure **

Now, if you use arbitrary activity as an ACK, the server thinks its second message was received, when actually it wasn't.

And it doesn't strictly require SASL; see my first comment.

wilhelmy commented 9 years ago

Okay, scratch SASL and SSL client certificates, the resume key idea is the way to go IMHO. Thanks, @kythyria.

As for @ShutterQuick's proposal: The client could actually send back the ID of the last message it received prior to disconnect as part of the resume negotiation and the server could simply resend everything that was sent past that point. There would be no need to ACK anything at all.

I'm glad to see some interest in the proposal.

ShutterQuick commented 9 years ago

@wilhelmy Yeah that makes a lot more sense. I like it.

kythyria commented 9 years ago

@wilhelmy Explicit acks allow the buffer to be smaller. You'd typically only have to store the last few seconds worth of messages, rather than the last 3-4 minutes (however long ping timeout is set to), at least if clients ack reasonably promptly. IDK if that's a concern or not.

I'd also make the buffer a few minutes deeper after a disconnection is registered, just in case reconnecting takes a while.

wilhelmy commented 9 years ago

@kythyria Well I was thinking about a combination of what @ShutterQuick and I said: Periodic ACKs each time the client sends a message to the server to flush the buffer up to that point as well as resuming from a certain point. Not sure which is worse, but I think explicit ACK after each message wastes a lot of bandwidth compared to wasting a maximum of a few hundred KB of RAM per client.

Assuming 200KiB per client and 1024 locally connected clients, this still means only 200MiB RAM wasted, and the realistic number would probably be much lower, given that 200KiB per client would imply that all clients are timing out simultaneously. Considering current memory costs, this sounds negligible.

I assume this also means we will need to increase typical SendQ limits as well as perhaps the duration of a ping timeout. Also, a client needs to reconnect to the exact same server it was previously connected to in order to avoid having to move state across ircds.

ShutterQuick commented 9 years ago

@wilhelmy I think you're being very pessimistic. For channels implementations would probably use a playback buffer shared between the clients, so you could probably have quite deep buffers without running any significant memory penalties.

kythyria commented 9 years ago

@wilhelmy Good point. It's particularly irrelevant when in a bouncer since you're probably saving a potentially very large buffer anyway that spans hours or even days.

And yes, it might be part of the thing SendQ limits control: Make the buffer bigger, and only purge things from it when acked. The other thing to change might be the "ping timeout" message: if a session is deemed to have gone away for good, the displayed time should be longer than the age of the oldest unacked message, to communicate a pessimistic estimate of when the user went away.

edit: You'd also purge from the send buffer when something's old enough you would have gotten an error if the client hadn't received it. Also, I wonder if PING and PONG could be used as ACKs somehow.

kaniini commented 9 years ago

@kythyria wrote:

@M2Ys4U TCP guarantees in-order delivery, so the server doesn't need to add a tag to every message, thus saving bandwidth.

TCP does not guarantee delivery. It only guarantees that what you do receive will be delivered in order. The ack cap, as I understand it, would be used for determining at which point a network partition occured.

synandro commented 9 years ago

Something like the mosh session state protocol would make sense here. If the sendq gets filled when using this, simply just stop adding to the sendq at all. You'd lose anything that happened when you weren't connected, but your client would still be there. The real downside with this is, it sure seems like an easy way for nick jupe bots to park themselves on nicks. Maybe not so much of an issue on networks with nickserv, but on networks without nickserver...it sounds like it would be a major issue. As much as I'd LOVE something like this, especially for mobile clients that switch IP addresses. I'm just not sure the best way to do this, without getting pwnt.

kythyria commented 9 years ago

@synandro I at least was considering this more from the perspective of bouncers than servers specifically, and can't see ones lacking services turning this feature on. I'd also expect there to be a timeout after which the client is considered to have gone away entirely (possibly retroactively stuffing things into the buffer in the bouncer case)

Silently eating messages without dropping the session sounds horrible, not least because it severely exaggerates the problems caused by pingout quits coming well after messages have already been eaten.

synandro commented 9 years ago

Well if they are directed at the client(and not the channel) you just send an AWAY notification(after having set them away). The issue here is, where do you stop building sendq? Surely you don't want to just let it keep growing, otherwise you'll chew through all of your memory really fast.

Oh and you asked who I was...ircd-ratbox coder..aka AndroSyn....

wilhelmy commented 9 years ago

The proposal doesn't require services, if a resume-key is specified on first connect, and the client uses the key to reestablish the session on a subsequent connection.

Furthermore, I'd definitely add a time-out, just a longer one than currently, to give mobile clients the possibility to switch networks, acquire a DHCP lease and reconnect, etc.

I think one way to make nick-hogging impossible on networks which don't want to support services would be to introduce a second, shorter timeout after which the nickname is considered free. If someone uses NICK thenickname, the connection which is currently timing out is simply terminated, stating "Ping timeout" as the reason. Something like a nick-hog timeout of 120-320 seconds (comparable to whatever's currently used for ping-timeout) and a definite connection-is-dead timeout of approximately 10 Minutes. After the nick-hog timeout, one could additionally mark the connection as away. Comments?

B00mX0r commented 8 years ago

@synandro Given that different servers will have different amounts of memory, and given that different IRCds will manage memory differently, it seems best to let IRCds/server owners decide the sendq limit. I like your mosh idea, but I feel like it fits more into an IRCd suggestion than an RFC.

As for jupe bots, the problem could be partially mitigated by unparking the held nick if any new connections are established from the IP of the dead client (regardless of nick). So if I join as A, timeout, then join as B, A will be released.

SadieCat commented 8 years ago

if any new connections are established from the IP of the dead client (regardless of nick).

This won't work in practise because carrier-grade NAT is a thing.

B00mX0r commented 8 years ago

@SaberUK I don't think carrier-grade NAT is a large enough concern in regards to detecting parked nicks. In terms of resuming a previous session, yes, IP itself is not enough to go off of, which is why session keys that only the client and server know would exist. For parked nicks, IRC daemons could specify a config value for how many sessions can exist on the same IP before the daemon starts considering nicks parked. If IRC were very worried about carrier-grade NAT, then we should not have klines, glines, or zlines either. Carrier-grade NAT poses a risk to them, too, but not one that is substantial enough to make IP an invalid way of hypothesizing the identity of a person on IRC.

In regards to an earlier point about detecting when a client disconnected, the client should send the unix timestamp of the last received packet upon attempting to resume a session. The server will send all messages after that timestamp that the server has saved. Again, memory management should be handled by the IRCd, since it is based on server resources, which varies significantly from server-to-server. Additionally, I propose that when the server discards an old message from its memory, it continuously stores the timestamp of the most recently discarded message. This means that the server will know if it is missing some messages that were sent after the client's socket ended, and can send a numeric to the client alerting it that it is only sending messages after (x) time.

To reiterate my major point, concerns like server resources and amount of sockets before something is considered a parked nick should be handled by different IRCds, as they can very significantly for each IRC server.

kythyria commented 8 years ago

Unix timestamps are unsuitable, as clocks can differ, and do. For the server-to-client direction, strictly increasing message IDs would be more appropriate (you only need to explicitly send them occasionally) so long as clients can tolerate holes.

B00mX0r commented 8 years ago

@kythyria Relative time matters; absolute time does not. The client can send what it thinks the current unix timestamp is minus what it thinks unix timestamp is of the last received packet.

Message IDs do solve this issue, but they are unnecessary.

kythyria commented 8 years ago

That still doesn't seem advisable given that it can take sometimes substantial time for things to fly through the network. I'd rather use something that eliminates the problem completely than try and work out whether it's a problem in this case.

B00mX0r commented 8 years ago

@kythyria I think that if clients want to ensure they receive every missed message and are worried about network lag, then they should utilize IRCv3.2's server-time extension.

My question regarding message IDs is what would the format be like? How are they generated? How long are they? How do you ensure no duplication if it is random? I'm concerned that message IDs is a whole new rabbit hole of deciding how they should be designed. What do you have in mind right now?

kythyria commented 8 years ago

It's only really hard if the IDs have to be global. If they don't, then starting at 1 for registration and counting up from there is sufficient for server-side IDs.

grawity commented 8 years ago

How do you ensure no duplication if it is random?

a) append a timestamp to the random ID; b) make it long enough that you can just leave it to chance (as with UUIDs/GUIDs). A 128-bit UUID is only 22 base64 characters.

In email, Message-IDs are qualified with the sender's fqdn. If IRC message IDs only need to be unique within the network, then combine them with the server's SID.

For the 'local' part, pick 1-2 of: {random uuid; current time; ircd boot time; ircd-wide message counter}. We can specify some examples, but I don't think the format needs to be identical across all ircds – as far as the client is concerned, it's an opaque string.

janicez commented 5 years ago

I know the TS4 document (does anyone remember that?) proposed a partial resume function, way back in the dark days when TS3 was the law of the land and there was only EFNet. It functioned in the case of technical /kill's (nickname collisions), where the connection remained open and was told to enter a new nickname (TS6 and InspIRCd servers replicate this function by changing a user's nickname to their UID, and not in the noisy way TS4's nicklost resume would), and a separate resume function functioned using a cookie as proposed here, but instead of full session restore, the client was just given ops in channels they were previously operator on (within the last quarter hour) if they join them after reconnecting with the cookie.

justjanne commented 5 years ago

This is basically implemented in the new draft/resume spec, and discussion should probably be moved to #306 and related specs #362 and #393

slingamn commented 5 years ago

There's also oragono.io/bnc (tl;dr the server acts as a bouncer, allowing multiple clients to opt in to sharing the same nickname, as long as they both authenticate with SASL to the associated account).

ircv3 / ircv3-specifications