tangledbytes / nodejs-snowflake

Generate time sortable 64 bits unique ids for distributed systems (inspired from twitter snowflake)
https://www.npmjs.com/package/nodejs-snowflake
Apache License 2.0
176 stars 15 forks source link

Concurrent bulk generation of Snowflake IDs from same machine is failing #22

Open anubhav-pandey1 opened 7 months ago

anubhav-pandey1 commented 7 months ago

If we try to bulk generate multiple snowflake IDs concurrently from the same machine, all the IDs are same:

const generateSnowflakeId = (machineId = null) => {
  const config = {
    instance_id: machineId,
    custom_epoch: new Date("2001-01-01").getTime(),
  };
  const uid = new Snowflake(config);
  const snowflakeId = uid.getUniqueID();
  return snowflakeId;
};

const COUNT = 10;
const CURRENT_MACHINE_ID = 1; // Assumed
const snowflakeIds = [];
for (let i = 0; i < COUNT; i++) {
  const snowflakeId = generateSnowflakeId(CURRENT_MACHINE_ID);
  snowflakeIds.push(objectId);
}

All the IDs are the same. Am I doing something incorrectly?

Node.js version: 16.17 and 18.17 Package version: 2.0.1

anubhav-pandey1 commented 7 months ago

It seems that the sequence ID is not being updated after creating the first snowflake. I'm making this hypothesis because deliberately blocking the thread for 1 ms using sleep(1) seems to do the trick for generating unique snowflakes.

This would make it a critical issue as the library would essentially stop performing well in highly-concurrent horizontally scaled environments.

anubhav-pandey1 commented 7 months ago

I've raised a PR that seems to fix the issue. Please feel free to use the code and mould it to suit your coding standards and style.

tangledbytes commented 7 months ago

Hi @anubhav-pandey1, I am not sure if I understand the issue here. One Snowflake instance is supposed to be used per machine. Each instance should get a unique machine ID.

In the above example, it seems that 10 instances of Snowflake objects are being created with the same machine ID.

Am I missing something here?

tangledbytes commented 7 months ago

I also have a test in the CI which tests exactly this to make sure we don't generate duplicate IDs: https://github.com/tangledbytes/nodejs-snowflake/blob/master/tests/snowflake_core.rs#L144

anubhav-pandey1 commented 7 months ago

If a single machine decides to bulk generate 10 or 1000 objects (for instance for a bulk insert SQL query for payments), they should have different identifiers - this is a fundamental requirement for any unique identification system. The Snowflake ID system is supposed to support this by having an atomically incremented (thread-safe) sequence number so that 1000s of Snowflake IDs can be concurrently bulk generated by the same machine.

If all the IDs generated by the same machine within the same millisecond are the same, then I think it is not a unique identifier (UID) and also not Snowflake.

anubhav-pandey1 commented 7 months ago

Also, your test seems to be running on a sub-second granularity in a single-threaded environment. This could be why it missed this issue.

High-throughput systems with high-concurrency would work on a sub-millisecond granularity ie. they can generate thousands of objects within the same millisecond. Such systems are usually multi-threaded too and might run into race conditions with the current version of Snowflake ID generation.

I would recommend using my shared Node.js code to attempt bulk generating many IDs concurrently and the issue will be replicated.

I've also raised a PR (#23) to attempt fixing this issue.

anubhav-pandey1 commented 7 months ago

Hey @tangledbytes, a gentle request to review the PR #23 and merge the fix if it is acceptable.

tangledbytes commented 7 months ago

Hi @anubhav-pandey1, how do you plan on adding multi threads to nodejs? If worker threads or other things are supposed to be used then each thread should gets its own instance. There is absolutely no reason to have this thread-safe because there is no reason to share states across different threads. Just let each thread have their own instance (different machine ids though). As for your code, from my perception, it is not the right way to use, the correct code would look like:

const CURRENT_MACHINE_ID = 1; // Assumed
const config = {
    instance_id: CURRENT_MACHINE_ID,
    custom_epoch: new Date("2001-01-01").getTime(),
};

const uid = new Snowflake(config);

const generateSnowflakeId = () => {
  const snowflakeId = uid.getUniqueID();
  return snowflakeId;
};

const COUNT = 10;

const snowflakeIds = [];
for (let i = 0; i < COUNT; i++) {
  const snowflakeId = generateSnowflakeId();
  snowflakeIds.push(objectId);
}

Also, I would like you to clarify your use case here.

Benjamin-Willard commented 5 months ago

Yeah I ran into this exact problem when designing our system. I would run into multiple ids of the same millisecond when creating records on the same database instance using AWS Lambda. This is why my database contains a server that gets created when a new instance is spun up or a new lambda source is being generated. I have there a specific server id that gets passed to the global Snowflake configuration of the instance id. After doing this I no longer had problems.

So what @tangledbytes suggested is correct. If you are using like nodejs cluster or something of that sort. Each "worker thread" should have it's own unique id that is generated per configuration instance_id.

Based on your setup, you would probably want to do something like a snowflake cache of a single instance per instance_id.

An example of what I have done is this. Using Sequelize as the database ORM.

let __SnowflakeInstance: Snowflake;
const SnowflakeGenerator = () => {
    if( !__SnowflakeInstance ) {
        __SnowflakeInstance = new Snowflake({
            custom_epoch: 1324512000000,
            instance_id: Reflect.get(global, '__SNOWFLAKE_INSTANCE_ID__') || 1
        })
    }
    return __SnowflakeInstance;
};

Then I have something like

bindInstanceId(instanceId: number) {
        Reflect.set(global, '__SNOWFLAKE_INSTANCE_ID__', instanceId);
    }

In my main DbSchema class object. I call this right after I initialize the database and load the server's configuration. In this case I then bind that instance.

Example of an server database entry that I have that uses two different instance ids.

Screenshot 2024-04-19 at 1 03 23 AM