paralleldrive / cuid2

Next generation guids. Secure, collision-resistant ids optimized for horizontal scaling and performance.
MIT License
2.64k stars 54 forks source link

Help with a port to ColdFusion #24

Closed bennadel closed 1 year ago

bennadel commented 1 year ago

@ericelliott I'm working on a port of cuid2 to ColdFusion and I wanted to ask about the length of the generated value. Given a shared instance of a Cuid class, should all of the cuid tokens generated from said class be of the same length? In my implementation, they vary by 1; and I'm not sure if that's the way it's supposed to be; or, if I have a bug in my implementation.

I saw that the length can be parameterized; but, I wasn't sure of the expectation for a shared instance of a Cuid class.

I know I have bugs because my values won't fit inside varchar(24). Just not sure how many bugs I have at this time 😆

geommr commented 1 year ago

I think that's the intended usage, see this example (all 3 have length of 24):

import { createId } from '@paralleldrive/cuid2';

const ids = [
  createId(), // 'tz4a98xxat96iws9zmbrgj3a'
  createId(), // 'pfh0haxfpzowht3oi213cqos'
  createId(), // 'nc6bzmkmd014706rfda898to'
];
bennadel commented 1 year ago

Ok, thanks! Time to start debugging (rolls up sleeves).

bennadel commented 1 year ago

Oh, I see it, I missed the .slice( 1, length ) that he has in the generated value.

bennadel commented 1 year ago

@ericelliott what is the significance of using a Prime number in the entropy generation? Just curious to add a comment in my ported component.

bennadel commented 1 year ago

I've got something working for my CFML port -- https://github.com/bennadel/CUID2-For-ColdFusion

xaevik commented 1 year ago

@ericelliott what is the significance of using a Prime number in the entropy generation? Just curious to add a comment in my ported component.

@bennadel this should provide context as I asked the same question when I built the .NET port: https://github.com/xaevik/cuid.net/issues/10#issuecomment-1368265133

bennadel commented 1 year ago

@xaevik Thank you for pointing me to that -- in reading your comment, I realized that I missed some of the logic in the entropy generation - I didn't realize there was a multiplication at play. I'm not sure that I follow all of the logic, or if part of that was only necessary for the JavaScript version; but, I will go back and re-read with closer detail.

bennadel commented 1 year ago

@xaevik oh wait, after a closer look at my own code, I think I am doing that with a randRange() call. Instead of using the Math.floor(random() * prime), I do randRange( 0, prime ), which I think should accomplish the same thing.

bennadel commented 1 year ago

The other thing I'm not quite clear on it why the "generate entropy" is base36-encoding its value. Every time the entropy part is called, the result is eventually used as part of another base36-encoding. As such, we essentially end-up encoding the value multiple times. Perhaps this is to turn the limited number-space (10 digits) into an alpha-numeric space (36 characters) prior to hashing?

I'm also thinking that since I'm already in the Java space, maybe I don't have to jump through as many hoops. Can't I just get random bytes using Java cryptographically secure generation without having to worry about prime numbers? I wish I understood security better.

xaevik commented 1 year ago

@bennadel

The other thing I'm not quite clear on it why the "generate entropy" is base36-encoding its value. Every time the entropy part is called, the result is eventually used as part of another base36-encoding. As such, we essentially end-up encoding the value multiple times. Perhaps this is to turn the limited number-space (10 digits) into an alpha-numeric space (36 characters) prior to hashing?

You do not necessarily have to go this route. If you look at my .NET implementation, I return a byte array every time a request for Entropy is called. I also do something similar with my PHP implementation.

I also store each piece of the CUID to their nearest primitive during instantiation. This makes the creation of the SHA-3 hash easier since as methods from the BouncyCastle library accept byte arrays (.NET does not support SHA-2 natively so a third party library was necessary).

I'm also thinking that since I'm already in the Java space, maybe I don't have to jump through as many hoops. Can't I just get random bytes using Java cryptographically secure generation without having to worry about prime numbers? I wish I understood security better.

If you read further into the comments between Eric and I in that issue I linked, he explains his reasonings for the prime numbers (at the time) and you'll see I chose the RandomNumberGenerator.GetBytes) method for generation of entropy rather than prime numbers. Under the covers, .NET will select the strongest cryptographic source (CSPRNG or /dev/urandom). If you have such an option available to you in CF, it is better to use it.

bennadel commented 1 year ago

@xaevik thank you for the feedback - not just about the random bytes, but also about hashing bytes. ColdFusion can do both - it's hash() function can accept either strings or binary values. I will see what it looks like to update this approach!

bennadel commented 1 year ago

@xaevik I'm looking at the PHP implementation here:

https://github.com/xaevik/php-cuid2/blob/eacc8eed61d772069a4040413ea79888cedef2f6/src/Cuid2.php#L84-L97

... and I see that both salt and random are each a random collection of 32-bytes. This begs two questions:

  1. Why not just have a single value that is 64-bytes long?
  2. Does the order of inputs into the message-digest / hash actually matter? If they are, in aggregate, all just sources of entropy that are being hashed together, then the order in which the parts are concatenated (prior to hashing) shouldn't really matter, right?

I'm trying to re-implement using Java' MessageDigest class, and I think this might be the same:

private string function generateHashBlock() {

    var inputs = MessageDigestClass.getInstance( "sha-256" );

    // Cast numeric inputs to strings, then get underlying bytes.
    inputs.update( charsetDecode( getTickCount(), "utf-8" ) ); // TIME.
    inputs.update( charsetDecode( counter.getAndIncrement(), "utf-8" ) ); // COUNTER.
    inputs.update( charsetDecode( processFingerprint, "utf-8" ) ); // FINGERPRINT.

    // Random bytes generated by "SecureRandom" class.
    inputs.update( generateBytes( cuidLength * 2 ) ); // Salt + Random??????

    var result = BigIntegerClass
        .init( inputs.digest() ) // PERFORM THE ACTUAL HASH OF ALL THE INPUTS.
        .toString( 36 )
    ;

    return( result.right( - 2 ) );
}

I think I should be able to change the order in which I call the various inputs.update(...) executions, and the hash should still be unique???

bennadel commented 1 year ago

@xaevik Sorry, when I say that the "order doesn't matter", I did not mean to imply that the outputted value would be the same in both cases—I only meant that both values would be "equivalently unique". I'm thinking about it like a Coin flip. Imagine that I had to flip a coin 5 times and could get either Head (H) or Tails (T). Maybe this is a poor analogy, but if I flip it 5 times and get:

H T H H T

That has some percentage likelihood of happening. But, if I just arbitrarily swap the orders and take the same values and rearrange them as:

H H H T T

... that's a different output, but it' still the same likelihood as the other output since each part has the same randomness (50/50). As such, the first is no more or less random than the second.

Extending that analogy to the hashing function, I would assume (I don't know much of anything about how hashing actually works) that the order of the inputs may produce a different value, but not a value with different uniqueness. But, again, I don't know that much about hashing or security.

xaevik commented 1 year ago

@bennadel I removed my answer to your original query as it made some erroneous assumptions. To put it a bit more succinctly, I followed the order as implemented here to align properly with the specification. My knowledge in this sector is basic so I usually try not to deviate from established patterns.

bennadel commented 1 year ago

@xaevik ha ha, I know so little about this stuff 😆 I too was trying to follow @ericelliott 's implementation. But, going back to your other conversation with him, he mentioned that he needed something that worked in JavaScript + Node. Since we're using more robust (ie, older) platforms, we might not have the same constraints that he had.

xaevik commented 1 year ago

@bennadel Correct, we have access to stronger sources of entropy, however I am watching another issue for the gentlemen developing the rust implementation of cuid2 as he and I both have the same concerns with about the fingerprinting given the scope of platforms both our languages support (.NET).

bennadel commented 1 year ago

I've posed the question to my security team at work to see if they have any insight. Here's what I asked:

This is a random question. I’m working on a ColdFusion version of the CUID2 library to generate Globally Unique IDs, and part of the algorithm uses secure hashing of various sources of entropy. For the sake of discussion, let’s assume that a CUID ID is generated via:

toBase32( sha256( A, B, C, D ) )

Where A, B, C, and D are all sources of entropy (ex, counter, process fingerprint, random bytes). My question is thus: Does the order of the hash inputs matter? Meaning is:

sha256( A, B, C, D ) … just as unique as sha256( D, C, B, A ) … which is just as unique a sha256( B, A, D, C )

Or, is there something I’m fundamentally misunderstanding about how hashing works?

To be clear, I’m not suggesting that the various hashes produce the same value - I understand that all the values will be different. But, I’m questioning whether or not they will all have the same uniqueness characteristics?

bennadel commented 1 year ago

From one of the people on my security team (answering to my question on order):

I don't think the order of the sources matter. The only limitations on the uniqueness characteristics should depend on the quality of the entropy sources, i.e. real random vs pseudo-random.

bennadel commented 1 year ago

I tried to explore this idea visually, graphing hashes with different input-orders. At least visually, I can't see any different in terms of order -- https://www.bennadel.com/blog/4394-does-the-order-of-hash-inputs-matter-in-terms-of-uniqueness-and-distribution.htm

ericelliott commented 1 year ago

The order of hash inputs does not matter, but if you want your port to qualify as a port of CUID2, you should use SHA-3, not SHA-256. They are a completely different class of hash algorithms with different security properties. If you're having trouble finding a port, SHA-3 is also known as keccak, and you may find it in an Ethereum library because it is the hashing algorithm used to protect the security of the Ethereum network.

bennadel commented 1 year ago

Alright, at this point, I'll bow out and chalk this up a a fun learning experience. With ColdFusion, when we get into the territory of having to load 3rd-party JAR files, it's simply not a good developer experience since we never really had a good package manager like npm or maven; so, if SHA-256 isn't sufficient, then the barrier to entry becomes non-trivial.

I might keep trying to clean up the code (for my own benefit); and, if people ever what to try it out in Java 9+, they can just edit the line of code that chooses the algorithm since SHA3 was introduced natively in Java 9.

Anyway, thanks for the conversation! This has given me lots to think about.

ericelliott commented 1 year ago

It's crazy to me that SHA-3 doesn't exist in ColdFusion. It has been the NIST standard recommendation since 2015. 🙈

Given that using the correct hashing algorithm is not just a matter of importing an existing package, I'll reluctantly make an exception and accept SHA-256.

bennadel commented 1 year ago

@ericelliott it's all good, don't worry about it. ColdFusion is built on top of Java; and it seems that there is a Java version of the CUID2, so ColdFusion devs can always just use that one. I was only hoping to create something more native. And, to be fair, if ColdFusion is running on top of Java 9+ , SHA3 is available - I just know that a lot of people still use Java 8. Though, maybe that's an assumption that I need to check as well - I never quite know how fast things get adopted.

bennadel commented 1 year ago

One thing I could do is try to use the sha3-256 and if it throws an error, I can fallback to sha-256. Of, I suppose I could just make the algorithm selection an argument to the component constructor. Let me try the latter and see how it looks.

ericelliott commented 1 year ago

The fallback sounds like a great plan. Can you do it once and keep it selected instead of making it a parameter?

bennadel commented 1 year ago

Totally, I wouldn't want to try it over and over since it should be static in the given environment.

bennadel commented 1 year ago

Ok, I ended up just making algorithm an argument you could pass-in during initialization. It defaults to sha3-256, but will also accept sha-256 (any other algorithm will throw a validation error).

https://github.com/bennadel/CUID2-For-ColdFusion/commit/d5d1092baade1b69b651c6fa9fdcec5793785e88#diff-6c7d2837d9685ccf7a053fcc42b9a06fade9f2bccd1d8bb6666907b56fbd7725R51

Since I'm in Java and have access to the SecureRandom, my entropy algorithm is somewhat simplified. I generate 64-bytes or random data, then add the time bytes, the counter bytes, and the fingerprint bytes. I don't end up using any of the Prime number stuff or the salt stuff since I assume that is subsumed by the 64-bytes of random data.

https://github.com/bennadel/CUID2-For-ColdFusion/blob/master/lib/Cuid2.cfc#L121-L131

ericelliott commented 1 year ago

Let's get this port added to the docs. I noticed my name is misspelled in the code comment at the top of the file.

bennadel commented 1 year ago

@ericelliott will get that fixed, sorry!

ericelliott commented 1 year ago

Are we ready to add this port to the docs?

bennadel commented 1 year ago

@ericelliott yes, how can I help get this done? Should I fork and create a PR?

ericelliott commented 1 year ago

Done.