simonlast / node-persist

Super-easy persistent data structures in Node.js
MIT License
719 stars 78 forks source link

Concurrency issues #31

Open joepie91 opened 9 years ago

joepie91 commented 9 years ago

When doing heavy concurrent writes/persists that progressively decrease in size, files can sometimes become corrupted as newer data only partially overwrites older data (as there is no locking mechanism in place).

I can write a testcase later if needed, as I'm currently rather busy, but I figured I'd file an issue before I would forget.

I did write a very hacky monkeypatch (in Coffeescript) that at least ensures write queueing:

writeQueue = []
currentlyWriting = false
_setItem = persist.setItem

persist.setItem = (key, value) ->
    new Promise (resolve, reject) ->
        _setItem.call(persist, key, value)
        addItemToQueue key, value, resolve, reject
        triggerWrite()

addItemToQueue = (key, value, resolveFunc, rejectFunc) ->
    writeQueue.push [key, value, resolveFunc, rejectFunc]

triggerWrite = ->
    if not currentlyWriting and writeQueue.length > 0
        currentlyWriting = 1
        [key, value, resolveFunc, rejectFunc] = writeQueue.shift()

        Promise.resolve(persist.persistKey(key))
            .then (result) -> resolveFunc(result)
            .catch (err) -> rejectFunc(err)
            .finally ->
                currentlyWriting = false
                triggerWrite()

It's very inefficient - it doesn't attempt to 'deduplicate' queued writes and it always persists the last known version at the time of persisting, so under heavy concurrency it'll likely end up writing identical data many times... but at least it doesn't corrupt the files on disk :)

EDIT: For clarification, these monkeypatched methods I wrote are what caused me to discover the issue:

persist.addListItem = (key, item) ->
    newList = [item].concat (persist.getItem(key) ? [])

    persist.setItem key, newList

persist.removeListItem = (key, item) ->
    newList = (persist.getItem(key) ? [])
        .filter (existingItem) ->
            return (item == existingItem)

    persist.setItem key, newList

persist.removeListItemByFilter = (key, filter) ->
    newList = (persist.getItem(key) ? [])
        .filter (item) ->
            return !filter(item)

    persist.setItem key, newList

Rapidly calling removeListItem several times (eg. like would happen in a queue) would result in the aforementioned writes that progressively decrease in size.

akhoury commented 9 years ago

setItem already returns a promise, why can you use that? wait for each promise to resolve then call the next one. Or you can use setItemSync(), no wait, you wanted concurrency.

I'd be interested to see a test case when you have a chance (maybe paste a version it in JavaScript :) as some of us can't easily read coffee)

joepie91 commented 9 years ago

Waiting on the returned promise is not practical - the setItem calls originate from different parts of the code (eg. an Express application processing many requests at the same time), and it's unreasonable to expect every part of the code to be aware of the state of every other part of the code. That defeats the entire concept of modularity, and this is why ensuring write safety should be a concern of the node-persist module.

I'll post a testcase (in JS) later - I mostly just dumped my Coffeescript code here because that's what my own codebase is written in, and it's better than nothing.

akhoury commented 9 years ago

@joepie91 no offense, but a major point of nodejs is for the code-execution to be asynchronous or non-blocking, since, by default-design, nodejs is single threaded. So, using the term "concurrent" is not very accurate in a single thread (unless you're talking clusters over a multi-core system)

However, I think you indirectly bring a good point, "node-persist" does not support "atomic transactions" - which is totally expected since node-persist is running in the same thread as the application using it, (where any other DB system have it's own background service, i.e. mysql, redis, postgres, mongo etc..)

That does not mean it's not possible, I think we can implement a node-service to do that or maybe just queue system (which would be less efficient), but possible.

matthew-dean commented 9 years ago

Waiting on the returned promise is not practical

@joepie91 It sounds like you may want to build a queue system, and after adding to the queue, return a promise. Fulfill the promise when that item in the queue is processed. Of course, something like that could also be built into the core of node-persist, no?

matthew-dean commented 9 years ago

In other words, the calls to setItem could be asynchronous, but the database writes could be synchronous.

matthew-dean commented 9 years ago

Oh wait, a queue system is what @akhoury said. Oops.

joepie91 commented 9 years ago

no offense, but a major point of nodejs is for the code-execution to be asynchronous or non-blocking, since, by default-design, nodejs is single threaded. So, using the term "concurrent" is not very accurate in a single thread (unless you're talking clusters over a multi-core system)

I am referring to concurrent writes, which is a thing that Node.js absolutely does.

However, I think you indirectly bring a good point, "node-persist" does not support "atomic transactions" - which is totally expected since node-persist is running in the same thread as the application using it, (where any other DB system have it's own background service, i.e. mysql, redis, postgres, mongo etc..)

This is a separate concern - regardless of whether it supports atomic transactions, it should ensure that it does not corrupt the file (and currently, it corrupts the file on concurrent writes).

That does not mean it's not possible, I think we can implement a node-service to do that or maybe just queue system (which would be less efficient), but possible.

A batched queue would possibly be a solution; ie. on every set, it schedules a batch write for the next tick (but not to execute before a possible previous write has completed). That way, you can still handle all sets that occur within the same tick in a single write, which mostly resolves the efficiency issue.

akhoury commented 9 years ago

(and currently, it corrupts the file on concurrent writes)

when it shouldn't, fair enough.

@joepie91 I like the batched-queue idea. I am currently on vacation for another month. if no ones beats me to it, I might take a stab at it when i get back.

millette commented 8 years ago

As a general rule, files are hard to deal with.

creativefctr commented 7 years ago

This is frustrating! Using getItemSync is the dirty way which will block an entire application! I think the least that could be done is creating an internal queue inside node-persist for non-synced write operations that guarantees that write operations on a particular key are sequential. It will handle this situation internally and this way, calls from all contexts would be safe.

akhoury commented 7 years ago

at this point, I am not so sure a queue is enough, we might need some job processing mechanism that also spawns child processes, writes a batch and then close them? but that seems like an overkill. Also, with batch writing means that if the app crashes, some write-requests may get lost.

Also, that doesn't solve the issue with different apps using the same storage dir.

The best solution is to run as a separate service with the same node api to talk to, but now we're implementing a database engine, which I really don't want to do.

Really node-persist is not designed for this, it's more like a localStorage of the browser, if you want a database, i think you should use a database.

But going back to the original issue filed, a smart queue might solve OP's issue ONLY.

creativefctr commented 7 years ago

@akhoury I don't think we need anything more than this issue. Everyone knows what node-persist is and I don't think anybody expects it to be write-safe regarding external writes. But being write-safe within the same app, given node's asynchronous nature, looks like a necessary feature. For my own simple API server, I was forced to use sync methods to prevent a disaster! In fact, when writing code I thought node-persist would handle this scenario internally! Then I accidentally found this issue while I was looking up something else and was forced to change some lines to sync (I mean I expected this issue to be handled as a basic internal feature).

musicin3d commented 7 years ago

I'm also having a problem with this. I chose node-persist for a small project, thinking it would be the fastest way to get up and running. Ran into a corrupted file while running tests.

If we're undecided on how to address it, perhaps we could at least add a warning to the docs?

Edit: It's important to stay on task here. All other concerns aside (eg. collisions with other apps), node-persist should be able to ensure that a basic application doesn't step on it's own toes when it calls setItem().

mgttt commented 7 years ago

I wrote a solution FYI at #93

zdila commented 6 years ago

I've implemented serializing get and set operations per key. It is in typescript and requires node 10 because it uses asynchronous generators. But you can transcompile it to older nodes.

import storage from 'node-persist';

function createStorage<T>(name: string) {

  interface ISetOperation {
    type: 'set';
    data: T;
  }

  interface IGetOperation {
    type: 'get';
  }

  type Union = ISetOperation | IGetOperation;

  const executor = createExecutor();
  executor.next();

  async function get() {
    return (await executor.next({ type: 'get' })).value as T;
  }

  async function set(data: T) {
    await executor.next({ type: 'set', data });
  }

  async function* createExecutor(): AsyncIterableIterator<T | undefined> {
    let result;

    for (;;) {
      const item: Union = yield result;

      switch (item.type) {
        case 'get':
          result = await storage.getItem(name);
          break;
        case 'set':
          await storage.setItem(name, item.data);
          result = undefined;
          break;
      }
    }
  }

  return { get, set };
}

Example of use:

import { IHttpLog } from '~/types';
import { createStorage } from '~/storage';

const httpLogsStorage = createStorage<IHttpLog[]>('httpLogs');

export async function loadHttpLogs() {
  return await httpLogsStorage.get();
}

export async function saveHttpLogs(logs: IHttpLog[]) {
  await httpLogsStorage.set(logs);
}
rasgo-cc commented 4 years ago

Hitting the same issue. Maybe node-persist could make use of bottleneck internally? https://www.npmjs.com/package/bottleneck

thelaughingwolf commented 4 years ago

Also encountered this issue. I was also using node-persist as a sort of queuing mechanism, and I got a corrupted JSON file: {"key":"downstream.updates.zendesk-ticket-fields.hubspot-company-changed","value":{"##########":{"zip":"*****"}}}*****"}}} Looks like a previous update had changed the address, a later update changed the zip, and node-persist didn't fully replace the file.

rasgo-cc commented 4 years ago

I don't have time right now to create a PR but in my case I apparently fixed the issue by using Bottleneck. All you have to do is to wrap the functions with a limiter. Something like this:

const limiter = new Bottleneck({
  minTime: 1,
  maxConcurrent: 1,
});

export async function createCache(cachePath: string) {
  await storage.init({
    dir: cachePath,
    ttl: 1000 * 60 * 60 * 24, // 24hrs
  });
}

export const put = limiter.wrap(
  async (key: string, value: any, opts: CacheOptions = {}) => {
    return await storage.setItem(key, value, opts as any);
  }
);

export const get = limiter.wrap(async (key: string) => {
  const value = await storage.getItem(key);
  return value;
});