swarthy / redis-semaphore

Distributed mutex and semaphore based on Redis
MIT License
148 stars 28 forks source link

Losing locks in distributed system #205

Closed elliott-with-the-longest-name-on-github closed 6 months ago

elliott-with-the-longest-name-on-github commented 6 months ago

I have about 32 workers running the following process:

const sema = new Semaphore(
  redisClient,
  'my-key',
  100,
  {
    acquireTimeout: 30 * 60 * 1000,
    lockTimeout: 10 * 60 * 1000,
    refreshInterval: 1 * 60 * 1000,
  },
);

const transferItem = async (key: string): Promise<void> => {
    try {
      await sema.acquire();
      return await asyncWorkTakingApprox100ms(key);
    } finally {
      void sema.release();
    }
}

// later, inside my SQS event consumer
const chunks = chunkIdentifiers(itemIdentifiers);
for await (const chunk of chunks) {
  await Promise.all(chunk.map((item) => transferItem(item)));
}

Whenever contention for the semaphore's lock gets high, all of the workers simultaneously crash with Lock Lost errors. I can't for the life of me figure out what's going on -- I increased the lock timeout a ton just to see if for some reason a transfer operation was taking forever, but this happens within seconds of receiving high loads, so there's no way that's it. I'm happy to provide more information, but I'm not sure what would be useful.

elliott-with-the-longest-name-on-github commented 6 months ago

I can reliably reproduce with this script:

import { Semaphore } from 'redis-semaphore';
import Redis from 'ioredis';

const client = new Redis()
const sema = new Semaphore(client, 'asdf', 25, {
  'acquireTimeout': 30 * 60 * 1000,
  'lockTimeout': 2_000,
  'refreshInterval': 100,
});

const transferItem = async () => {
    try {
      await sema.acquire();
      console.log('processed')
      return await new Promise(resolve => setTimeout(resolve, 1_000));
    } finally {
      void sema.release();
    }
}

await Promise.all(Array.from({length:500}).map(transferItem))
console.log('done')