openhab / openhab-cloud

Cloud companion for openHAB instances
Eclipse Public License 2.0
314 stars 162 forks source link

Redis connection lock #413

Closed digitaldan closed 1 year ago

digitaldan commented 1 year ago

We have an issue where race conditions between reconnects can cause issues. This implements a simple locking method with a sane timeout value to ensure only one connection can try connecting at a time. A client socket.io heartbeat (pings) keeps the lock alive. It expires after missing 2 heartbeats + 10 seconds or when the user disconnects.

Using redis in this way acutally has many more advantages we could use for routing and notifications in future commits.

digitaldan commented 1 year ago

@ssalonen ping here, i think this is good, i have tested on my system as much as i can, i inclined to give this a try in production soon, which i'm hoping will eliminate our split brain connections woes.

digitaldan commented 1 year ago

EDIT2: theoretical sequence of events leading race condition

Yes, that is a possibility. We could move the lock delete code into the mongo callback, that might better protect against this, but there lies the possibility that the mongodb result takes longer then the TTL of the key, and so the same problem.

I mentioned in the other PR my plan was to upgrade both mongodb and redis, which i have already started. Hopefully i can get this done by mid Feb. One of the features in upgrading redis is the GETEX call, which both retrieves the key and resets its TTL in one atomic call. This could be useful to at least extend the time of the key when we retrieve it and give a larger window for mongo to update . If it takes longer then a min for mongo to do that, we probably have bigger issues :-)

Thanks for taking a look, appreciate the second set of eyes.

digitaldan commented 1 year ago

Sorry, reading your post again you are talking about deleting the wrong lock. There is a small chance that between getting the lock and sending the del command, we could delete a different lock for sure. And yes, that might be a good case to use a small lua script to perform the action atomically. Its probably worth a look. I think i initially dismissed using lua as I assumed it was potentially expensive, but reading this makes be believe it may not be that bad https://www.coreforge.com/blog/2022/02/redis-lua-benchmark/

What i thought you were talking about was the disconnect code clobbering another connections record after releasing the lock, but i just checked that and the disconnect code only updates the DB if the connectionId of the calling code matches the connectionId in the mongo record, this is done atomically in mongo:

OpenhabSchema.statics.setOffline = function(connectionId, callback) {
    this.model('Openhab').findOneAndUpdate(
        { connectionId: connectionId },
        { $set: { status: 'offline', last_online: new Date()}},
        callback );
}

I recently added this, and then totally forgot about it 🤦

ssalonen commented 1 year ago

Yeah I think we are on the same page now.

Already with this PR, I would expect the probability of race conditions to be much lower!