Robust recovery when documents get out of sync

In some situations, if editors get out of sync with each other, Slate Collaborate does not recover gracefully resulting in lost edits.

The robust recovery will allow the editors to resync quickly. In the case of a sync failure, the collaborative editor will sync up to the last state in the collaboration server. Edit loss (if any) should be kept minimal like a few keystrokes.

Internet connection is offline If the client is currently offline, the plugin will wait until the client is online before trying to establish a connection.

Internet connection disconnected If the client managed to establish a connection but the client is no longer online, the client should wait until it is online to attempt to re-establish a connection.

Any outstanding changesets should be kept until the connection is re-established. No new changesets should be accepted while offline. Should the connection be restored, any outstanding changesets should be re-sent to the server.

Backend connection failed to connect If the client is online but the connection to the backend is never established, the client should attempt to connect immediately, 5s, and every 30 seconds thereafter.

Backend connection break If the client is still online but the connection to the backend is list, the client should attempt to retry in immediately, 5s, and every 30s thereafter until the connection is re-established.

Any outstanding changesets should be kept until the connection is re-established. No new changesets should be accepted while disconnected. Should the connection be restored, any outstanding changesets should be re-sent to the server.

Changeset failed to apply If the client receives a changeset and the client’s outstanding changeset can no longer be processed, the client should discard all outstanding changesets.

Hi Wayne,

Thanks for putting this together!

I'd like to propose a few changes so this will work with Issue #3 .

Note: This proposal turned out to be more complex than I first thought so my apologies. I feel like it is better if we get it right the first time though rather than having to rewrite it later.

The Problem

We don't want to keep session data around indefinitely because we want the app's database to be the source of truth for the document, not the session data in Redis.

If a user can reconnect at any time, we can run across this issue:

User A: loads a document from the app's DB. It is at version 1. But User A can't connect for some reason. User A's kid just dropped ketchup on the floor and so User A closes their laptop lid.
User B: loads a document from the app's DB at around the same time. It is also at version 1. He makes a number of edits, let's say he even works on the document for 30 minutes. Then he saved it at version 2.
User A: They open their laptop an hour later and now they have a connection. User B's session has ended (and is cleared from Redis now). User A's editing session starts, but with data from 1 hour ago. When User B saves at version 3, it overwrites everything that User B did in version 2.

Proposal

We allow the developer to set a timeout on the editing session. This would be set on the plugin and on the collaboration server. The one on the client should be shorter than the time on the server. Let's call it CLIENT_SESSION_TIMEOUT and SERVER_SESSION_TIMEOUT where CLIENT_SESSION_TIMEOUT < SERVER_SESSION_TIMEOUT. For example, we will say it is 4 minutes for CLIENT_SESSION_TIMEOUT and 5 minutes for SERVER_SESSION_TIMEOUT.

The purpose of the keep alive is to allow the user to reconnect in the case of a dropped connection for a limited time (maybe by default of 5 minutes).

One other thing we add to the proposal is a third state. Currently, we have ready and !ready. I propose we change this to connecting (which can also mean reconnecting), ready and no-session (which can mean session ended, session timed out, or no connection to the Internet). If a user gets a no-session, the correct course of action is to refresh the browser which can be done by the user or programmatically by the developer to make it seamless.

Overview

The general idea is that within the server session timeout, the session is preserved on the collaboration server, and we can confidently connect to to the collaboration server in the client session timeout timeout (which is less than the server session timeout) without having to refresh the document data from the application's database.

In a case where we can't make an initial connection (whether because we are offline or because the backend connection is failing), we are okay to try within the client session timeout. After the client session timeout, we want to indicate that there could have been an editing session by another user that started and then ended; this would mean that our starting document data would be wrong. We indicate this by setting the state to no-session.

In a case where we were connected but then got disconnected, we try to reconnect. If there is no session existing, we discard data because we don't know if there have been any edits from other users while we were disconnected. If there is an existing session, we can update it with the changesets. If there is an existing session but it's a different session, we also discard the changesets.

Details

With this in mind, let's review the scenarios. Word in italic are the same as the original proposal.:

Internet connection is offline: If the client is currently offline, the plugin will wait until the client is online before trying to establish a connection. Wait up to CLIENT_SESSION_TIMEOUT (4 minutes) and then we end up in no-session. If the user connects within the 4 minutes, we are safe because the Redis session data was saved for 5 minutes. No opportunity for stale document data to be introduced.
Internet connection disconnected: If the client managed to establish a connection but the client is no longer online, the client should wait until it is online to attempt to re-establish a connection.

Any outstanding changesets should be kept until the connection is re-established. No new changesets should be accepted while offline. Should the connection be restored, any outstanding changesets should be re-sent to the server.

There are two situations where the changesets should be rejected: (1) there is no session data in Redis. In other words, the session timed out by reaching the SERVER_SESSION_TIMEOUT (2) there is session data but it is from a different session than the one we were previously on.
Backend connection failed to connect
If the client is online but the connection to the backend is never established, the client should attempt to connect immediately, 5s, and every 10 (not 30) seconds thereafter. Wait up to CLIENT_SESSION_TIMEOUT (4 minutes) and then we end up in no-session. If the user connects within the 4 minutes, we are safe because the Redis session data was saved for 5 minutes. No opportunity for stale document data to be introduced.
Backend connection break
If the client is still online but the connection to the backend is list, the client should attempt to retry in immediately, 5s, and every 10 (not 30)s thereafter until the connection is re-established.

Any outstanding changesets should be kept until the connection is re-established. No new changesets should be accepted while disconnected. Should the connection be restored, any outstanding changesets should be re-sent to the server.

There are two situations where the changesets should be rejected: (1) there is no session data in Redis. In other words, the session timed out by reaching the SERVER_SESSION_TIMEOUT (2) there is session data but it is from a different session than the one we were previously on.
Changeset failed to apply
If the client receives a changeset and the client’s outstanding changeset can no longer be processed, the client should discard all outstanding changesets.

If the session on the server exists, resync to the session that exists on the server. If there is no session on the server, set no-session.

Note: I think in the long run, there should be a way to do this:

Less opinionated and more dynamic, like the authorization/authentication callbacks
Completely on the server so we don't need to add a client timeout

I'm trying to see if I can come up with a design that works, but I feel like I need time to think through the scenarios. The above design should work though IMO.

DRAFT:

IGNORE THIS FOR NOW. JUST THINKING OUT LOUD AND KEPT FOR REFERENCE/IMPROVEMENT.

This maybe a crazy idea but might solve the source of truth problem so that the developer can choose:

Whether or not to allow for online only or offline mode
Whether to keep changeset history or just snapshots of documents for history or no history
How long to keep changeset history for

Overview

The basic idea is that it works like this.

When a user connects or reconnects, on the callback to some method which I'll call initSessionState, we return a Slate document and an array of changesets like { document: { ... }, changeSets: [ ... ] }. If, when the user connects, there is already a session state, the initSessionState method doesn't get called and we are good to go. One of the options that gets passed in is staleDuration which is how long the document has been sitting (i.e. stale) since it got loaded into memory last.

Here's a sample of an initSessionState:

function initSessionState({params, staleDuration}) {
  if (staleDuration > 5 * MINUTES) {
    throw new NoSessionError("")
  }
  return {
    document: params.document, // client passes the document through
    changeSets: [], // no changesets
  }
}

Another thing that could happen is that the client passes in a database id

async function initSessionState({articleId}) {
  /* In this scenario, there is no need to test for staleDuration because we grab the data from from db */
  const article = await db.table('articles').getById(articleId)
  return {
    document: article.body,
    changeSets: [], // no changesets
  }
}

But let's say the the developer wanted to support an offline mode. They can then save the change sets to their database and repopulate it into Redis.

async function initSessionState({articleId}) {
  /* In this scenario, we grab the data fresh and we also populate with 24 hours of changesets */
  const article = await db.table('articles').getById(articleId)
  const changeSets = await db.table('changesets').query({articleId, insertedAt: AFTER_24_HOURS_AGO })
  return {
    document: article.body,
    changeSets: [], // no changesets
  }
}

When a session ends, we can return a retention time.

// forgot the actal name of method
async function allSessionsInDocumentEnd({params, documentId, document, changeSets}) {
  await db.table('articles').set({document}).where({id: params.articleI})
  // this query is not real, but you get the idea. The changeSets may have ids and
  // some may already be in db. Just assume this does the right thing.
  await db.table('changeSets').upsert(changeSets)

  // keep the sesion alive for 5 minutes after the last user leaves
  return {
    sessionKeepAlive: 60 * 5 // how long to keep session data alive for in Redis
  }
}

The general idea here is that the developers can choose how long to keep sessions around in Redis. When getting started, developers can just set a sessionKeepAlive for 5 minutes or so. In the onInitSessionState the developer can just make sure that the current document isn't more than 5 minutes stale.

Later, the user can choose to save part of the changesets. This would allow for offline as long as the developers wants to keep it. It would also solve the problem of source of truth. It would still be in the developer's database but let's say that they did a programmatic update of a document. They would have a lot of freedom to do what they want. They could:

Update the article.body and delete the changeSets for that article.
They could connect to the collaboration server, send the operation through, and that would take care of the changeSets.
They could update the article.body and insert the corresponding changeSets manually.

It becomes a very flexible and powerful system that can grow with developer's needs. It's also interesting because the developer can prune the changeSets as they desire or keep them forever if they wanted to. And it makes sense not to fill Redis up with all the history from old editing sessions.

This change does not include the changes RE: CLIENT_SESSION_TIMEOUT and SERVER_SESSION_TIMEOUT as these are concerns that will be covered in #3.

The changes this includes are the following:

If the internet connection is lost, the connection is re-established when it can and any pending changesets will be re-sent.
If the client changesets cannot be transformed, any outstanding changesets will be reset
If the server changesets cannot be applied, the document will be reset

stackiva / slate-collaborate-support