Losing locks in distributed system

swarthy / redis-semaphore

Distributed mutex and semaphore based on Redis

MIT License

148 stars 28 forks source link

I have about 32 workers running the following process:

const sema = new Semaphore(
  redisClient,
  'my-key',
  100,
  {
    acquireTimeout: 30 * 60 * 1000,
    lockTimeout: 10 * 60 * 1000,
    refreshInterval: 1 * 60 * 1000,
  },
);

const transferItem = async (key: string): Promise<void> => {
    try {
      await sema.acquire();
      return await asyncWorkTakingApprox100ms(key);
    } finally {
      void sema.release();
    }
}

// later, inside my SQS event consumer
const chunks = chunkIdentifiers(itemIdentifiers);
for await (const chunk of chunks) {
  await Promise.all(chunk.map((item) => transferItem(item)));
}

Whenever contention for the semaphore's lock gets high, all of the workers simultaneously crash with Lock Lost errors. I can't for the life of me figure out what's going on -- I increased the lock timeout a ton just to see if for some reason a transfer operation was taking forever, but this happens within seconds of receiving high loads, so there's no way that's it. I'm happy to provide more information, but I'm not sure what would be useful.

import { Semaphore } from 'redis-semaphore'; import Redis from 'ioredis'; const client = new Redis() const sema = new Semaphore(client, 'asdf', 25, { 'acquireTimeout': 30 * 60 * 1000, 'lockTimeout': 2_000, 'refreshInterval': 100, }); const transferItem = async () => { try { await sema.acquire(); console.log('processed') return await new Promise(resolve => setTimeout(resolve, 1_000)); } finally { void sema.release(); } } await Promise.all(Array.from({length:500}).map(transferItem)) console.log('done')

swarthy / redis-semaphore

Losing locks in distributed system #205