jetify-com / typeid

Type-safe, K-sortable, globally unique identifier inspired by Stripe IDs
Apache License 2.0
2.92k stars 38 forks source link

On the possible ambiguity when decoding. #20

Closed fxlae closed 1 year ago

fxlae commented 1 year ago

Hi,

First of all, cool project! After implementing it myself, I wanted to share some thoughts.

Since TypeIDs have a fixed length with known padding, they can be encoded and decoded in a straightforward manner. However, this does not resolve a certain ambiguity that arises when decoding the suffix, depending on the leftmost character. This is likely already known, but I believe its implications could be made more explicit.

Imagine the first three bits of a UUID to be 100. With padding, that would be 00100. Now, encoding is simple:

encode(00100) = '4'

And so is decoding:

decode('4') = 00100

Then we strip the padding and get back our initial three bits: 100. However, decode('c'), decode('m'), and decode('w') lead to this exact same result, as their binary representation is XX100. After discarding the first two bits, 100 remains in all cases. In short, this implies that if two TypeIDs are identical except for their leftmost suffix characters, and both characters map to the same binary representation after stripping the first two padding bits, the resulting UUID is the same. 32 TypeId suffixes that only differ in the first character map to only 8 unique UUIDs.

Yes... strictly speaking, no TypeID suffix that was encoded as described in the formal specification can ever start with another character than '0'-'7', as these are the only characters with a binary representation beginning with 00..., which is exactly the padding. But the specification does not explicitly restrict a TypeID suffix not to begin with '8'-'z', syntactically, those are still valid TypeIDs.

I'm not suggesting this is a problem. The specification is not incorrect. It just does not (in mathematical terms) describe a bijective function, and I'm concerned that end users of TypeID libraries may intuitively expect the encoding and decoding process to be bijective.

An illustration


This behavior can be observed with the current implementation of the command-line tool from this repository.

First, let's decode and re-encode a TypeID suffix starting with a character from between '0' and '7':

$ typeid decode prefix_01h2xcejqtf2nbrexx3vqjhp41
type: prefix
uuid: 0188bac7-4afa-78aa-bc3b-bd1eef28d881

$ typeid encode prefix 0188bac7-4afa-78aa-bc3b-bd1eef28d881
prefix_01h2xcejqtf2nbrexx3vqjhp41

As expected, the encoded result is equal to the original TypeID.

Now, let's take the same TypeID, but replace the leftmost character of the suffix with something between '8' and 'z', which still constitutes a syntactically correct TypeID:

$ typeid decode prefix_81h2xcejqtf2nbrexx3vqjhp41
type: prefix
uuid: 0188bac7-4afa-78aa-bc3b-bd1eef28d881 # same as above

$ typeid encode prefix 0188bac7-4afa-78aa-bc3b-bd1eef28d881
prefix_01h2xcejqtf2nbrexx3vqjhp41

But now: prefix_81h2xcejqtf2nbrexx3vqjhp41 != prefix_01h2xcejqtf2nbrexx3vqjhp41

As mentioned above, if we try this for all 32 characters, the command-line tool decodes 32 different TypeIDs to only 8 unique UUIDs:

[0,8,g,r]1h2xcejqtf2nbrexx3vqjhp41 -> 0188bac7-4afa-78aa-bc3b-bd1eef28d881
[1,9,h,s]1h2xcejqtf2nbrexx3vqjhp41 -> 2188bac7-4afa-78aa-bc3b-bd1eef28d881
[2,a,j,t]1h2xcejqtf2nbrexx3vqjhp41 -> 4188bac7-4afa-78aa-bc3b-bd1eef28d881
[3,b,k,v]1h2xcejqtf2nbrexx3vqjhp41 -> 6188bac7-4afa-78aa-bc3b-bd1eef28d881
[4,c,m,w]1h2xcejqtf2nbrexx3vqjhp41 -> 8188bac7-4afa-78aa-bc3b-bd1eef28d881
[5,d,n,x]1h2xcejqtf2nbrexx3vqjhp41 -> a188bac7-4afa-78aa-bc3b-bd1eef28d881
[6,e,p,y]1h2xcejqtf2nbrexx3vqjhp41 -> c188bac7-4afa-78aa-bc3b-bd1eef28d881
[7,f,q,z]1h2xcejqtf2nbrexx3vqjhp41 -> e188bac7-4afa-78aa-bc3b-bd1eef28d881

My thoughts:

I hope this feedback is in some way helpful.

loreto commented 1 year ago

@fxlae Thanks for reporting. I agree with you: we want a bijective function, the following should always hold: typeid == encode(decode(typeid)) for all possible typeids. I'm going to update the spec so that the first character has to be 7 or less, and I'll add test cases to the spec for it.

@TenCoKaciStromy, @cbuctok, @sloanelybutsurely, @faustbrian, @akhundMurad, @broothie, @conradludgate, @Frizlab and @ongteckwu this will affect your implementations. I'll update this issue once I have the updated test cases, and that way you can all update your implementations to handle that edge case?

conradludgate commented 1 year ago

Makes sense to me 👍

loreto commented 1 year ago

PR under review: https://github.com/jetpack-io/opensource/pull/75

loreto commented 1 year ago

Closing. Spec now addresses this case.

sloanelybutsurely commented 1 year ago

typed_elixir 0.2.0 published to accommodate this change to the spec: https://diff.hex.pm/diff/typeid_elixir/0.1.0..0.2.0

Frizlab commented 1 year ago

swift-typeid 0.3.0 released, with overflow check and tests update.

cbuctok commented 1 year ago

Same for .Net Standard 2.0 pull request and NuGet.

faustbrian commented 1 year ago

https://github.com/BombenProdukt/typeid/commit/38a6710ab1bfa4aa586ef378afbf2a7ac07c59e4

conradludgate commented 1 year ago

Rust: https://github.com/conradludgate/type-safe-id/commit/9a4d3bfb10ccad6bdbbd1dec9cc5c990b51295b3

loreto commented 1 year ago

Amazing, thank you all for the quick response!