ShortUUID should not sort the alphabet

snstanton commented 2 years ago

I am attempting to convert a shortuuid generated using node.js and the Flicker base 58 alphabet:

shortid = 'apECAPA6RdHB6HB51FwsdN'
FLICKR_BASE58 = '123456789abcdefghijkmnopqrstuvwxyzABCDEFGHJKLMNPQRSTUVWXYZ'
uuid = shortuuid.ShortUUID(alphabet=FLICKR_BASE58).decode(shortid)

Unfortunately this doesn't work:

Traceback (most recent call last):
  File "<ipython-input-54-88d3ac3dd885>", line 3, in <module>
    uuid = shortuuid.ShortUUID(alphabet=FLICKR_BASE58).decode(shortid)
  File "/Users/scotts/.pyenv/versions/python38/lib/python3.8/site-packages/shortuuid/main.py", line 79, in decode
    return _uu.UUID(int=string_to_int(string, self._alphabet))
  File "/Users/scotts/.pyenv/versions/3.8.7/lib/python3.8/uuid.py", line 205, in __init__
    raise ValueError('int is out of range (need a 128-bit value)')
ValueError: int is out of range (need a 128-bit value)

It turns out that ShortUUID is sorting the alphabet before using it. This is incorrect and leads to this failure. It is possible to correctly decode the id like this:

import uuid as _uu
uuid = _uu.UUID(int=string_to_int(shortid, FLICKR_BASE58))

This is a backwards compatibility issue, so switching to an unsorted implementation will probably require a compatibility flag similar to legacy.

skorokithakis commented 2 years ago

Hmm, this issue is that the ShortUUID is incompatible with a node.js library, but why is the issue filed here? What makes the node.js library "correct" and ShortUUID "incorrect"?

snstanton commented 2 years ago

That is the symptom. The issue is that the ShortUUID library is changing the alphabet that it was passed in rather than using it as is. Rather than enforcing an ASCII sorting order on the alphabet, it should encode using the alphabet as presented, or at least provide that as an option.

skorokithakis commented 2 years ago

Sure, it also removes duplicates. Why shouldn't it?

snstanton commented 2 years ago

How about raising an exception for duplicates instead of mutating the data.

skorokithakis commented 2 years ago

It could do that, but why? You pass in an alphabet, you get output in that alphabet. You encode, it works, you decode, it works. Why does the order of the alphabet matter to you?

snstanton commented 2 years ago

Because not all short UUIDs are generated by this library. I would like to use the library to decode UUIDs generated by external sources, not just round trips from this library. The underlying code is perfectly capable of doing this. The only issue is the outermost ShortUIUD class that modifies the given alphabet rather than using it as given. The library would be more useful if it could handle both scenarios directly.

skorokithakis commented 2 years ago

Right, but the first issue with that is that this library doesn't have interoperability with other libraries as a goal, and the second issue is, why is this library wrong? Why shouldn't the external sources just sort their alphabets?

snstanton commented 2 years ago

I don't have any control over what other developers are doing with their code. If there is an encoded uuid in data I am trying to process, I need a way to decode it that works with the existing format. Just because this library's primary goal is to encode and decode uuids generated by one program does not mean that's the only thing it could be used for. I'm not saying what it it is doing now is "wrong". I'm saying it would be useful to have the option to not sort the alphabet so I can use the library for an adjacent use case. The beauty of a good library is that it can be used for things the original author did not envision.

skorokithakis commented 2 years ago

Certainly, but that would break backwards compatibility, so I have to weigh the usefulness of the new feature against the hassle of breaking everything for other users.

snstanton commented 2 years ago

That's why I suggested making it an optional flag.

skorokithakis commented 2 years ago

This is a very niche use case, and I'm not sure it's worth the extra interface/documentation/effort. Can you not just set _alphabet directly to the unsorted alphabet?

snstanton commented 2 years ago

I've already worked around the issue with the following code using the lower level string_to_int function:

def decode_shortuuid(value):
    # We have to use the lower level interface because the Flickr alphabet isn't in ASCII order
    # and ShortUUID sorts the alphabet before using it.
    return uuid.UUID(int=string_to_int(value, FLICKR_BASE58))

I don't need the change. The only reason I filed the ticket is because someone else might need similar functionality. Clearly you disagree, so feel free to close the ticket.

tino commented 2 years ago

Sorting isn't necessary to prevent duplication, right?

skorokithakis commented 2 years ago

It's not, but if you convert the list to a set to deduplicate easily you lose order guarantees, so you sort to go back to a repeatable order.

troyswanson commented 2 years ago

It would be nice for an official "Short UUID" specification so that the community could (potentially) guarantee interoperability between implementations across languages. No idea how to accomplish that, but it would be cool to see.

skorokithakis commented 2 years ago

That would be a great idea, there would be the problem of having everyone agree and obsoleting a ton of data when the libraries changed, though.

skorokithakis commented 2 years ago

Closing this for now.

skorokithakis / shortuuid

ShortUUID should not sort the alphabet #68