npm / lockfile

A very polite lock file utility, which endeavors to not litter, and to wait patiently for others.
ISC License
260 stars 33 forks source link

Library does not seem to fulfill contract #27

Closed ORESoftware closed 7 years ago

ORESoftware commented 7 years ago

I was using this library for a bit, but it seems to cause problems when I make a lot of concurrent requests to get a lock on the same file. I get a lot of "file is already open" errors. Maybe the user (me) should handle those errors and do manual retries, but I was hoping that lockfile would do that tricky work for me.

So in order to get a more robust solution I started working on a mutex system with websockets. I ran a performance test, and lockfile apparently is WAY faster than websockets. It's so fast that I am starting to doubt that it actually is working properly. lockfile can process 300 serially lock/unlock requests in less than 30ms, which seems way too fast to be possible. These are in series! Which means it can go through a lock/unlock cycle in 30ms / 300 locks which is 1/10 of a ms! Seems crazy. When Live-Mutex does the same, it takes ~500ms, so that's 500ms / 300 cycles which is about 2ms which seems much more normal.

Despite being fast, lockfile totally chokes immediately when these 300 serial requests are turned into concurrent/parallel requests. So I am very confused - lockfile seems impossibly fast, but also clearly chokes in totally normal circumstances, circumstances that require locking (aka non-serial access to a resource)!

I created the (speed) tests in this project:

https://github.com/ORESoftware/locking-lib-comparison

If you are curious, you can check it out, run npm install, and then run

git clone git@github.com:ORESoftware/locking-lib-comparison.git
cd project
npm install
./test.sh

the speed tests are pretty rudimentary, but basically I am bemused about why lockfile runs so fast when running through locks serially and why it barfs so quickly when try to gain access to a lock concurrently.

The whole point of locking is to manage concurrent access to a resource, which means the lock itself will experience concurrent requests!

Live-Mutex is slower, but apparently more robust?

isaacs commented 7 years ago

This module is performing correctly.

You're not providing a numeric wait option, so effectively you're saying "Acquire this lock, and fail if you cannot do that immediately".

Since the file system moves very fast when operating on a single file in rapid succession, and Node's fs event pool is limited, and open and unlink are both atomic actions, it actually unlinks fast enough for the next lock to be acquired when you do it in a series, and there's no unnecessary waits. What you're testing is how fast your disk can create and remove entries in the file name table. On my Macbook Pro with a SSID drive, that's about 300 in ~275ms, which is reasonable since the fs will cache it after the first few passes.

However, if you try to open 300 locks at once, then what's happening is that the lockfile code says, "Welp, I couldn't acquire a lock, and you haven't told me how long to wait, so I guess you needed it right now, and error."

If you make this change:

-    lf.lock(file, function (err) {
+    lf.lock(file, { wait: Infinity }, function (err) {

then you'll see that it takes much longer. This is because of the thundering herd problem. On the first pass, the system does 300 open(2) calls, 299 of which fail and do a setTimeout to retry in 100ms. 100ms later, 299 open attempts, 298 of which fail, and wait another 100ms. It might be a good idea to provide a randomized wait time, just to reduce the thunder a little bit, but it's probably unlikely that in any real world use case you'll try to acquire 300 locks all in the same tick of the event loop, and again, node's limited fs pool serializes it a little bit anyway.

So, there's roughly 300 * 100ms of time spent just sitting around, and it ends up taking about 30s total. (Give or take, because the exact length of a setTimeout is somewhat fuzzy, and it's possible that if the unlink comes right before a open it won't have to wait exactly that number of times. On my system, I see it take up to 35s, or as low as 25s.) If we also add pollPeriod: 1 to the options, then it gets done in about 2s instead of 30.

Note that the 2s in the pollPeriod: 1 case represents a tremendous amount of waste! If anything, this is really slow. The underlying fs operations take on the order of tens of milliseconds total, and we should be introducing only 300ms of artificial wait time, so that extra second or so is just wasted time.

If you really need this thing to be fast, I'd recommend rewriting it in rust or c.

Also note that wait: Infinity is a really bad idea, especially without a stale value. If a process crashes or something, you don't want to be waiting around in a hung state forever while the file never gets removed.

isaacs commented 7 years ago

Here's another approach that's faster than using a wait value:

    lf.lock(file, { retries: 1000 }, function (err) {

You can throw a NODE_DEBUG=lockfile into the environment to have it log all the things it does.

ORESoftware commented 7 years ago

ok, I assume a combination of upping the wait value and the retries value would be ideal?

ORESoftware commented 7 years ago

@isaacs I can get it to work like so:


const path = require('path');
const async = require('async');
const lf = require('lockfile');

const a = Array.apply(null, {length: 300});
const file = path.resolve(process.env.HOME + '/speed-test.lock');

const start = Date.now();

async.each(a, function (val, cb) {

    lf.lock(file, {wait: 3000, retries: 5, stale: 50}, function (err) {
        if (err) {
            cb(err);
        }
        else {
            console.log('unlocking...');
            lf.unlock(file, cb);
        }
    });

}, function complete(err) {

    if (err) {
        throw err;
    }

    console.log(' => Time required for lockfile => ', Date.now() - start);
    process.exit(0);

});

pulling no punches, the amount of fine tuning to get it to work without an errors means that a failure can happen in any real dynamic system (aka real life).

Furthermore, with 300 requests happening in parallel, lockfile starts taking upwards of 8 seconds to complete (just run the code above). If you can add your parameters/options to the above to make it peform better I'd be really interested in what those are.

Frankly I don't care about performance, I really want something robust. I assume NPM uses this library, so I am willing to admit that I might just be using it wrong or mispurposing it but it seems like this simple test is pretty vanilla and it should work just fine, but it's either slow or error-prone or both.

I assumed that the library provided "sensible defaults" for wait etc, but I guess pollPeriod has a default, but wait does not.

At the very least, doesn't it make sense to have like 10ms or 20ms as a sensible default for "wait"?

ORESoftware commented 7 years ago

@isaacs, you're right, if I reduce wait to 0 and use plenty of retries, lockfile can match the performance of live-mutex for lower numbers of concurrent requests.

I officially released live-mutex yesterday:

https://medium.com/@the1mills/a-better-mutex-for-node-js-4b4897fd9f11

NPM lockfile performs better than some other locking libraries I have tried, when it comes to high concurrency lock requests

But live-mutex is designed to work much better than anything else I have found, and doesn't use any polling as part of the implementation. I recommend checking it out.

ORESoftware commented 7 years ago

But you will notice that the performance of lockfile will degrade, as you go from 100 concurrent requests to 1000. This is expected as it must use polling for the implementation, since it's using filesystem locking, so I am not saying the implementation is poor, just the nature of the library.


const path = require('path');
const async = require('async');
const lf = require('lockfile');

const a = Array.apply(null, {length: 1000});    // 1000 concurrent requests
const file = path.resolve(process.env.HOME + '/speed-test.lock');

const start = Date.now();

let i = 0;

async.each(a, function (val, cb) {

    lf.lock(file, {wait: 0, retries: 5000, stale: 50}, function (err) {
        if (err) {
            cb(err);
        }
        else {
            console.log('unlocking...' + i++);
            lf.unlock(file, cb);
        }
    });

}, function complete(err) {

    if (err) {
        throw err;
    }

    console.log(' => Time required for lockfile => ', Date.now() - start);
    process.exit(0);

});

compare that with live-mutex:

'use strict';

const async = require('async');

const lmUtils = require('live-mutex/utils');
const Client = require('live-mutex/client');

const conf = Object.freeze({port: 7003});

lmUtils.launchBrokerInChildProcess(conf, function () {

    const client = new Client(conf);

    client.ensure().then(function () {

        const a = Array.apply(null, {length: 1000});   // 1000 concurrent requests
        const start = Date.now();

        var i = 0;
        async.each(a, function (val, cb) {

            client.lock('foo', function (err, unlock) {
                if (err) {
                    cb(err);
                }
                else {
                    // console.log('unlocking...' + i++);
                    unlock(cb);
                }
            });

        }, function complete(err) {

            if (err) {
                throw err;
            }

            console.log(' => Time required for live-mutex => ', Date.now() - start);
            process.exit(0);
        });

    });

});

5000 millis for lockfile, 500 for LM

isaacs commented 7 years ago

Neat idea. Having an explicitly set-up server won't work for some of the applications where lockfile has to operate. I'd toyed with the idea of having a lockfile actually be a server listening on a socket, and so the first thing that acquires the lock does so as a server, and then subsequent attempts get in a queue, and get notified when it becomes available. I never got around to adding tests and error handling, i should finish it up.

Filesystem locks are nice for a variety of reasons. Very straightforward and effective. The problem with this library is that it's overly configurable, imo.

isaacs commented 7 years ago

Found it! It had a clever name, even. Just never got around to writing tests, it mostly already worked.

Running the pound and many-lock-unlock tests on my machine, I get this:

$ bash pound/t.sh

parallel
real    0m8.566s
user    0m10.100s
sys 0m3.073s

serial
real    0m9.005s
user    0m9.780s
sys 0m3.051s

lf-parallel
real    0m38.027s
user    1m9.661s
sys 1m5.398s

lf-serial
real    0m39.566s
user    1m12.766s
sys 1m8.448s

$ node pound/many-lock-unlock.js
parallel 1000/498ms => 2008 q/s, 0.498 ms/q
serial 1000/251ms => 3984 q/s, 0.251 ms/q
lf parallel 1000/16241ms => 62 q/s, 16.241 ms/q
lfs 1000/349ms => 2865 q/s, 0.349 ms/q

So, yeah, retries on fs-based lockfiles does not play nicely with concurrent lock contention! I'll probably write some tests for slocket soon. Similar idea to your module, but without the setup step. Of course, that means you lose some speed in some cases because lock contention at the server setup phase is going to be a bit slower. And there's a weird thing that happens where a server dies, and then the connection that had the lock becomes the new master (or a new round of contention begins).

isaacs commented 7 years ago

Er, the link might help: https://github.com/isaacs/slocket http://npm.im/slocket

ORESoftware commented 7 years ago

cool I will check it out. I am actually surprised why I can't get more performance out of my current setup. I can run up to about 2 lock/unlock cycles per millisecond. It's not that fast. Not sure why yet. Moving from require('ws') to require('uws') sped things about 3x. Although I am told that certain ws settings make it up to par with uws.

isaacs commented 7 years ago

Why use ws at all? Why not just vanilla tcp?

ORESoftware commented 7 years ago

Maybe when I started working on it I thought it could be used in the browser too, but yeah you're probably right! Vanilla tcp probably much faster.