Closed fxlae closed 1 year ago
@fxlae Thanks for reporting. I agree with you: we want a bijective function, the following should always hold:
typeid == encode(decode(typeid))
for all possible typeids. I'm going to update the spec so that the first character has to be 7 or less, and I'll add test cases to the spec for it.
@TenCoKaciStromy, @cbuctok, @sloanelybutsurely, @faustbrian, @akhundMurad, @broothie, @conradludgate, @Frizlab and @ongteckwu this will affect your implementations. I'll update this issue once I have the updated test cases, and that way you can all update your implementations to handle that edge case?
Makes sense to me 👍
PR under review: https://github.com/jetpack-io/opensource/pull/75
Closing. Spec now addresses this case.
typed_elixir 0.2.0
published to accommodate this change to the spec: https://diff.hex.pm/diff/typeid_elixir/0.1.0..0.2.0
swift-typeid 0.3.0 released, with overflow check and tests update.
Same for .Net Standard 2.0 pull request and NuGet.
Amazing, thank you all for the quick response!
Hi,
First of all, cool project! After implementing it myself, I wanted to share some thoughts.
Since TypeIDs have a fixed length with known padding, they can be encoded and decoded in a straightforward manner. However, this does not resolve a certain ambiguity that arises when decoding the suffix, depending on the leftmost character. This is likely already known, but I believe its implications could be made more explicit.
Imagine the first three bits of a UUID to be
100
. With padding, that would be00100
. Now, encoding is simple:encode(00100) = '4'
And so is decoding:
decode('4') = 00100
Then we strip the padding and get back our initial three bits:
100
. However,decode('c')
,decode('m')
, anddecode('w')
lead to this exact same result, as their binary representation isXX100
. After discarding the first two bits,100
remains in all cases. In short, this implies that if two TypeIDs are identical except for their leftmost suffix characters, and both characters map to the same binary representation after stripping the first two padding bits, the resulting UUID is the same. 32 TypeId suffixes that only differ in the first character map to only 8 unique UUIDs.Yes... strictly speaking, no TypeID suffix that was encoded as described in the formal specification can ever start with another character than
'0'-'7'
, as these are the only characters with a binary representation beginning with00...
, which is exactly the padding. But the specification does not explicitly restrict a TypeID suffix not to begin with'8'-'z'
, syntactically, those are still valid TypeIDs.I'm not suggesting this is a problem. The specification is not incorrect. It just does not (in mathematical terms) describe a bijective function, and I'm concerned that end users of TypeID libraries may intuitively expect the encoding and decoding process to be bijective.
An illustration
This behavior can be observed with the current implementation of the command-line tool from this repository.
First, let's decode and re-encode a TypeID suffix starting with a character from between '0' and '7':
As expected, the encoded result is equal to the original TypeID.
Now, let's take the same TypeID, but replace the leftmost character of the suffix with something between '8' and 'z', which still constitutes a syntactically correct TypeID:
But now:
prefix_81h2xcejqtf2nbrexx3vqjhp41 != prefix_01h2xcejqtf2nbrexx3vqjhp41
As mentioned above, if we try this for all 32 characters, the command-line tool decodes 32 different TypeIDs to only 8 unique UUIDs:
My thoughts:
'0'-'7'
. That's true, but the problem arises not during encoding, but during decoding. Input strings from external sources (users, clients, etc.) are not inherently trustworthy. Even syntactically correct TypeIDs lead to this ambiguity (as demonstrated above).'8'-'z'
as the leftmost characters, as no properly generated suffix should ever begin with those characters. This is what I did in my Java implementation that I submitted yesterday, because I initially assumed it was not permitted. Only later I found out that it isn't explicitly specified.I hope this feedback is in some way helpful.