jenssegers / optimus

🤖 Id obfuscation based on Knuth's multiplicative hashing method for PHP.
https://jenssegers.com
MIT License
1.27k stars 74 forks source link

Possible for attacker to determine prime, inverse, and random number? #52

Open nathan-io opened 4 years ago

nathan-io commented 4 years ago

Hi, thanks for this package!

I'm looking at this as an alternative to Hashids, which I'm avoiding due in part to:

Clearly if you have access to the salt, it is trivial to calculate the function in either direction so the basis for proving or disproving property 2 lies in how easy it is for an attacker to discover the secret salt.

and

" anyone using this library should assume that id's encoded by this library are fully reversible and as such it offers no security over using the raw integer ids.

source: https://carnage.github.io/2015/08/cryptanalysis-of-hashids:


With Knuth's integer hash, it seems like it would be impossible (or at least a few orders of magnitude more difficult) for an attacker to determine the prime number, inverse, and random number and defeat the obfuscation.

Am I correct in assuming this?

I couldn't find anything discussing this online. Are you aware of any research or discussion on this topic?

Obviously key obfuscation of any kind is no guarantee of security, and I have to develop the application such that it ultimately doesn't matter if an attacker gets a real id. That said, it would make me feel better to have a sense of just how hard this obfuscation would be to break, especially relative to Hashids.

Thank you for any info you can provide!

chris-morgan commented 2 years ago

I’m investigating this space as part of developing a new scheme, and figured I might as well chip in here, since I’ve analysed this in some detail (for a cryptography layman).

Optimus is very weak. If you care about valid IDs being unguessable, do not use it. I’m no cryptographer, but I can read and calculate enough to know that cryptanalysis will be very unkind to its function; I’m fairly sure that it’s a good deal worse than Hashids.

I suspect (again, as a layman) that with the default parameter sizes (31 bits), a cryptographic attack could determine the secret keys in a matter of minutes at the most, provided no more than a handful of sequential IDs. (I do not have the skill to do so myself, or the interest.)

This is the encryption algorithm:

encrypted-id = ((id × prime-key) mod 2bits) xor xor-key

… where bits defaults to 31, prime-key and xor-key are unsigned bits-bit numbers, and prime-key is a prime number.

Skipping the XOR part, which adds no cryptographic complexity, this is the algorithm of a Linear Congruential Generator of the form m a power of 2, c = 0, stepping the generator forwards one place to encrypt, and backwards one place to decrypt.

LCGs in general have some known weaknesses (read the rest of that Wikipedia article), but that specific class of LCGs is particularly bad for its weaknesses. The most catastrophic one here is that they work terribly with even seeds, meaning that half the IDs you encrypt are barely encrypted at all. I believe that with fewer than twenty sequential or randomly-spaced IDs, you may be able to calculate at least the last few bits of xor-key, and several bits of prime-key’s entropy. And it won’t take all that much more to chip away at the rest of the keys.

To demonstrate the badness of this class of LCGs, try bits = 5 and prime-key = 29, which corresponds to lcg(a=29, c=0, m=32): the cycles of this LCG are [[0], [1, 29, 9, 5, 17, 13, 25, 12], [2, 26, 18, 10], [3, 23, 27, 15, 19, 7, 11, 31], [4, 20], [6, 14, 22, 30], [8], [12, 28], [16], [24]]. 0, 8, 16 and 24 encrypt to themselves. 4 and 20 and 12 and 28 to each other. &c., and you can clearly see how much worse the even numbers are than odd. The only redeeming feature of the parameters chosen is the use of a prime a, which is more likely to produce results as good as you can get given c = 0 and m a power of 2 (though try lcg(a=31, c=0, m=32) if you dare); but compare it even to what that page describes about an LCG with well-chosen Hull-Dobell Theorem parameters, with its single cycle and better distribution, and… yeah, this one’s very, very bad.

I myself started with the LCG approach in mind, but the description of the difficulty of finding good parameters has put me off it—it’s clearly possible to do a fairly good job with LCGs (with known but entirely acceptable weaknesses, for this application), but frightfully easy to make a mess of it. I’m now going with block ciphers—somewhat more computationally expensive, but more reliable. My current prototype is Rust code and using RC5 for encryption, followed by Base58 string encoding. (I’m using RC5 to support both u32 and u64 IDs, including also a variable-width mode where the first 2³² IDs get encrypted with a 32-bit block size, and larger IDs with a 64-bit block size, thereby disclosing an ID epoch in return for getting much shorter IDs for the first four billion entities (4–6 characters, rather than 8–11). I’m also toying with other ciphers like Speck and Hasty Pudding for their particular properties.) Given that this Optimus repository is PHP, I will note that you’d have to be careful about signedness in applying any of this. (This is the sort of code that is way easier in Rust than any scripting language.)

On the bright side, it should be remembered that this particular application of encryption (IDs chosen by the server) is largely immune from chosen plaintext and chosen ciphertext attacks, so a lot of the vulnerabilities in encryption schemes become impractical. The hashids scheme is cryptographically broken, yes, but in practice the attacks won’t be viable on most systems. The Optimus scheme is I think even more broken, and probably more susceptible to known attacks, but in practice still probably sufficient. I’d suggest skipping all even IDs to avoid the main problem (I suspect that could make it stronger than hashids—gut feeling again, I’m no expert and haven’t even tried objective measurement), but… eh, if you’re just using it as a low-grade scrambler, fine, and if you do care, choose a better technique rather than trying to patch this one up, because it’s broken.

chris-morgan commented 2 years ago

I take back what I said about this being worse than Hashids. I had mostly glanced over Hashids and hadn’t actually stopped to deeply understand what it was doing, and now that I have I am quite horrified at just how atrocious it is. As a demonstration: in its default configuration, 44 sequential IDs is enough to break it completely. The attacks on Hashids are extremely viable on a large fraction of production systems.

Optimus is still very bad, but Hashids is just on a level of its own in how bad it is.

nathan-io commented 2 years ago

@chris-morgan thank you so much for this. I'm thrilled to hear that you've been exploring this area and developing a new scheme.

I'd suggest skipping all even IDs to avoid the main problem

I'd love to implement this somehow, but am wondering about the practicality.

In most applications, the RDMS is determining the primary key ID when the record is inserted. So we'd need to configure our RDMS to only use odd PK IDs. Assuming that's even possible, it feels like a messy solution.

I suppose you could instead generate some public/user-facing alternate identifier and store it in its own table column, and ensure that the number is unique and always odd. That does introduce potential for confusion, however. You'd have the PK ID, the unencoded user-facing ID, and the encoded user-facing ID.

On the bright side, it should be remembered that this particular application of encryption (IDs chosen by the server) is largely immune from chosen plaintext and chosen ciphertext attacks

This is comforting. To my understanding however, it seems there's still the potential for chosen plaintext attacks in some cases. For example, a multi-tenant SaaS application where a malicious tenant starts with their own database, and can assume a "real" primary key value of 1 on the first resource, and count up from there.

Of course, in that scenario, you may configure different prime, inverse, and random numbers for each tenant to isolate risk.

And as a safety measure in single-tenant systems, it seems to make sense to configure separate channels for each Eloquent model with IDs you want to obfuscate.