Why UUIDs? - Githubissues

janosh commented 9 months ago

I was looking at a wall of job IDs like this earlier and wondering if jobflow actually requires full 36-character UUIDs?

PhononUUIDs(optimization_run_uuid='cc10cd4d-944d-43b1-9651-2e168765c02e', displacements_uuids=['c29abb65-a8e5-47ff-a6d6-6776cd8e4095', 'e2e07788-daed-4201-921b-ec2e57c802c4', 'bf443b3d-6334-4b04-8f8a-29fabc08c37e', '23c36a18-4066-4047-a553-aa9fe2afa714', '81134dc7-a88f-486a-af3b-ddd9c1ba3eb1', '1009897b-a97c-4d4a-9cac-e952766b9608', '15015ddf-7967-410d-8fef-13642d7bd9e5', '1899803f-76c1-4768-bf23-c4d61fb51691', 'd39ce806-7403-495e-89a4-a4de961590bf', '35bb27e8-7f31-44ec-b684-0f9391b4228e', 'c6f188fb-2d26-47ff-b0f8-8438fae8ff58', '06b2249c-e7c6-4cf8-91b0-f572fccc209c', '18bbc0be-be4c-4131-8ef6-f6742c899acd', '8303eb32-162a-4a65-b75f-d8586aa23636', '392282e2-cbe6-447e-b7a9-e20a02e57cbe', '835b35f9-413c-479c-944d-14d9d4649b71', 'c59f17e4-1f9b-4e65-a9f2-dc9fc7314283', '95d9e094-802b-45c5-8630-3b0e8e6c3b64', 'dea7a13c-6921-42f5-adf5-7a4670319a7b', '6faa0b95-0058-4476-bd55-109d3746a334', '6a6fbce4-41b3-4b7a-8d69-0b6ffab7716c', '0ef5726d-3b62-4464-a5e9-f33237f59d83', '3de16018-90ad-4da7-9885-86a783d30c1c', '5c779412-0c29-423b-ae7b-c6eade5124bc', 'adc6ef3c-fc70-43cf-a819-741b45d0ec7f', '5b47a7ff-bb6d-45f4-b442-1183db8a58df', 'e01a19d7-3685-465a-a93b-bfa3e876081c', '79addbaa-32fb-4c5c-82c2-23c01dd9cd89', '6232ed4f-8b57-40a9-9fdf-0956913a84f6', '772a84ea-9f22-4167-8887-dc8ea46b2ad7', 'e74100ac-f17f-4fdc-81b9-18b2e3c69064', '7d0ba392-0eef-4da6-8d31-6eb84223dcc3', '98c688d5-5aa2-49cb-ac77-847b04abd145', '3e4a0fd0-d383-4b1a-b20c-56794fd9a096', 'ed466545-1988-4d52-8dec-835075a6049d', 'bb829a61-6e91-481f-b90e-ffa1baa0e876', '6e49c916-2197-4ff8-a582-52e751d20837', '6a5a7a71-3721-44e3-bd93-08c8128ef52c', '6dcba672-d54e-4900-b807-737f6adfc6e3', '6f13d708-a1cf-4884-b748-dcc932e753b9', '5721c7af-932b-4250-81f9-ab88f219a98f', '6f754a3f-81f6-46be-9160-95ffd0af6f9b', '723b9f8c-11ea-4261-801f-f02e1b4aa518', '90889945-059e-4777-9b94-a473ddacb185', '27066b5b-5c80-408b-a8aa-cae9b0304a26', '3389b0c1-e39d-4fba-b175-986b12c2baeb', '8185facc-f20b-4261-b8d2-cf584f50c151', '23251773-7ea8-42d5-a791-303240af896c', '5605e378-5648-44ab-b2dd-1caf44267c00', '6df6ccb8-d26e-4bbf-8285-18fdfe1ae64e', '2af8cf20-c1ee-48ae-bc52-dacfb60d9bf2', '9725444a-b59f-499b-9293-51641a65f689', '9c262ca0-15d3-4a78-8e53-a8643c0e46da', '39ebce16-b68c-4edc-8e2b-00d4cf677931', '267b87e0-ff6d-4832-a703-6d3cbc127e24', '186e0ac2-57a7-424f-8e95-09ddf6e25a86', '16a2f8cc-73ab-46ea-bc33-b6304250376e', '8fb0d733-8c8c-4df3-97f9-68937ea92b73', '788bc502-840a-4283-995b-6edf6df57458', 'f390865a-da66-4241-9bee-763279e050f9', 'c433c10b-64e1-4ace-aa07-13af868fbbc7', '23384eb4-21e3-4819-86d7-1f60a579d839', 'aa346ded-5d7a-4aff-a16a-413250513b40', 'b5a64bbd-ae68-49e2-8b60-f04e3e80158d', '5d392703-1fcd-42e1-a87f-8da4c2be82df', '6b3c4b43-2968-4a95-b32b-bf9be92aea6b'], static_run_uuid='4afd1757-09f3-44c4-ad15-93d099e989d0', born_run_uuid=None))

YouTube for example uses 11-character IDs?

https://youtu.be/R3N6BocbknM

It's case-sensitive alpha-numeric plus hyphen and underscore giving $64^{11} \approx 7.4 \cdot 10^{19}$ combinations. I think there would be readability and "pastibility" improvements.

utf commented 9 months ago

I'm not against using shorter IDs, especially since they're currently encoded as string not bson uuids. Is there a way to generate those short IDs without an additional dependency?

janosh commented 9 months ago

Yes, we could do

import random
import string

def get_random_id():
    """Return a random 11-character YouTube-like ID."""
    chars = random.choices(string.ascii_letters + string.digits + "-_", k=11)
    return "".join(chars)

Maybe pastibility would be better if we drop the dash and underscore.

def get_random_id():
    """Return a random 11-character alpha-numeric ID."""
    chars = random.choices(string.ascii_letters + string.digits, k=11)
    return "".join(chars)

gpetretto commented 9 months ago

I think that making the uuid shorter and easier for copy/paste would indeed be beneficial, but I have a few notes that may be considered before proceeding:

in jobflow-remote, in the CLI, I am using the structure of the UUID to distinguish if a user passes a uuid or a db_id. This is not a blocking issue, but I thought it was worth mentioning it. Maybe the new id structure could be made so to be clearly identifiable?
in the databases and in the code the UUIDs are always referred as uuid. Changing that would require some refactoring and maybe backward incompatible changes, not changing means that a uuid attribute contains something that is not a proper uuid

janosh commented 9 months ago

@gpetretto Great points both!

We could prefix the IDs with jf- which would allow identifying them via id.startwith('jf-') and would even allow future changes of the ID format without breaking these checks.
If we go ahead with this change, I vote to rename from uuid to id (or jf_id) since keeping uuid would be misleading.

gpetretto commented 9 months ago

Thanks for considering these points. I have a few more comments.

There is an error in the estimate of the number of possible combinations. It is not 11^64 but 64^11 ~ 7.37E19 combinations, which is considerably smaller than the number of different UUIDs (5.3E36). I suppose the possibility of collision is still rather low, but if this option is chosen, it may be convenient to increase the size of the string.
If changing the id format is an option, it may also be convenient to consider other algorithms and their features. For example something sortable. Here are a few examples I have come across:
- ULID
- Snowflake_id
- NanoID
- CUID
python implementations are typically available and they probably won't require further dependencies, but I suppose it would also be easy to add a simple reimplementation inside jobflow.

janosh commented 9 months ago

I was looking at NanoID when opening this issue but ULID looks even better. Only thing is 26 vs 36 is not that much shorter.

It is not 11^64 but 64^11 ~ 7.37E19

Oops, rookie mistake 😅

Andrew-S-Rosen commented 9 months ago

Maybe I'm not thinking hard enough about it, but is there any reason not to have the unique ID simply be the datetime + a small identifier?

time_now = datetime.now(timezone.utc).strftime("%Y-%m-%d-%H-%M-%S-%f")
f"{time_now}-{SMALL_ID_GOES_HERE}"

Then it would be interpretable but also with low probability of clashing even if SMALL_ID_GOES_HERE is not particularly diverse.

I guess mine isn't shorter though, so I take it back. 😅

jmmshn commented 8 months ago

The sortable point from @gpetretto is pretty important. Some of the more complex workflows in emmet require the IDs to be sorted (ideally based on creating time). So if we move away from UUID4 (which I think is random), I think this is actually a requirement.

Otherwise, we run into problems like you see here: https://github.com/materialsproject/atomate2/pull/655#issuecomment-1891260166

utf commented 8 months ago

Agreed that given the emmet limitations, having a sortable id is essential.

Of the sortable options mentioned

UUID1 is 36 characters
ULID is 26 characters
Snowflake-id is 19 characters but all numbers and quite difficult to tell the difference between ids.
NanoID and CUID are not sortable as far as I can tell.

Are there any other options/suggestions on what to use?

Andrew-S-Rosen commented 8 months ago

Just chiming in to say that, while it doesn't help now, looks like uuidv7 might address some of these needs in the future (e.g. for other projects). https://buildkite.com/blog/goodbye-integers-hello-uuids

janosh commented 8 months ago

Cool! uuidv7 support in std lib is tracked in https://github.com/python/cpython/issues/89083 with a planned release in Python 3.13.

We could install uuid7 in the meantime or copy-paste this function into jobflow

Andrew-S-Rosen commented 8 months ago

@janosh just a note: https://github.com/stevesimmons/uuid7/issues/1#issuecomment-1663402643

jmmshn commented 8 months ago

Maybe just use UUID1 for now, and when we add the support for UUIDS in emmet we can write it with both UUID1 and UUID7 in mind.

Andrew-S-Rosen commented 8 months ago

I think that makes the most sense of the available options.

utf commented 8 months ago

Ok, happy to go with that. @jmmshn would you be able to submit a PR? I will then release a new version of jobflow.

jmmshn commented 8 months ago

OK!

gpetretto commented 8 months ago

I think that also the initial points raised by @janosh in the first message are worth considering, i.e. the ability to quickly copy/paste and having a shorter ID. uuid1 and uuid7 address the sortability, but not the other two issues. ULID would address all of them (10 characters shorter, no hyphens, sortable). Why not considering that instead?

utf commented 8 months ago

I would also be happy with ULID. The only drawback is that we would need to either:

Add ULID as a dependency
Copy the code into jobflow.

What I'm not clear about is whether we'd also need to do the same for emmet in order to use the sorting features.

jmmshn commented 8 months ago

Ditto @janosh and @gpetretto on the size thing. since the UUID bit of code in jobflow is so small can't we just move it all to emmet core?

utf commented 8 months ago

Jobflow doesn't depend on emmet though (only atomate2 does).

From some quick reading, it seems like the sorting should be trivial without needing ULID as a dependency.

davidwaroquiers commented 8 months ago

Maybe in monty? Both emmet and joblow depend on monty.

jmmshn commented 8 months ago

I think we can just geta PR going with UUID1 for now and add others later.

gpetretto commented 8 months ago

I would like to point out two other smaller issues with uuid1: 1) it is not really sortable. The order needs to be reconstructed at a later time. Which is definitely a downside as it could be useful in DB queries. 2) For a single user the number of machines used to generate a new uuid are likely limited. New workflows are typically generated on the same machine and then added to the DB. So for a single user uuid1 can be seen as a very long id that just contains the timestamp.

As a general comment, I would avoid changing multiple times the algorithm for the id generation, as this may be confusing or leading to unexpected issues. Even changing it once may be a reason for concern. According to the website linked above, the definition of the uuid7 standard should be in its final stage. So I am wondering if it would not be better to wait for its release in order to minimize the number of changes.

davidwaroquiers commented 8 months ago

I would like to point out two other smaller issues with uuid1:

it is not really sortable. The order needs to be reconstructed at a later time. Which is definitely a downside as it could be useful in DB queries.

For a single user the number of machines used to generate a new uuid are likely limited. New workflows are typically generated on the same machine and then added to the DB. So for a single user uuid1 can be seen as a very long id that just contains the timestamp.

As a general comment, I would avoid changing multiple times the algorithm for the id generation, as this may be confusing or leading to unexpected issues. Even changing it once may be a reason for concern. According to the website linked above, the definition of the uuid7 standard should be in its final stage. So I am wondering if it would not be better to wait for its release in order to minimize the number of changes.

I also think it might be better to not change things too many times as jobflow (and atomate2) are already used in production by many people. One thing that was discussed above was also the renaming of the uuid attribute (to id or jf_id). This might also be something to keep in mind (as @gpetretto mentioned, having a uuid attribute which is not a "proper" uuid could make things confusing). Such a change could be done earlier with a deprecation. Not sure I would call it id nor jf_id though. Maybe uid (for unique id) ?

jmmshn commented 8 months ago

I agree with the naming issue, and a change to uid is probably needed. I think we can enable support for multiple id types, keep it at uuid4 as default for now, and swap it over to ulid or uuid7 at a later time. This will give me some kind of sortable id immediately so I can proceed with my stuff (just have a hack in a comparator for the uuid1's) then we can wait a while to swap it over to ulid as default after we have all used it in testing for a while.

Also, if someone is already committed to using uuid4 they can just tweak their jobflow.yaml and not disrupt anything even after the switch.

mkhorton commented 8 months ago

One important factor here is that MongoDB does have a native UUID datatype, which reduces the number of bytes to store it compared to a string representation. Since ULID is bit-compatible with UUID, it’s probably possible to switch away from using a string too.

mkhorton commented 8 months ago

That said, I’m not convinced about the argument against uuid7 if it’s a standard and implementations are imminent. If the size on disk is the same, I’m not sure it matters if the string representation has hyphens or not — and indeed, the hyphens help with readability.

dpldgr commented 8 months ago

Maybe just use UUID1 for now, and when we add the support for UUIDS in emmet we can write it with both UUID1 and UUID7 in mind.

A few comments on UUIDv1 and UUIDv7:

UUIDv1 is not easily sortable in the same way that UUIDv7 is because of it's bit layout, so I would recommend against it's use if performant sorting by timestamp is important.
UUIDv7 is close to being made a standard but has not been finalised yet. It's close enough that you could with high probability rely on the bit layout not to change: https://datatracker.ietf.org/doc/draft-ietf-uuidrev-rfc4122bis/14/
A 3rd option would be to use UUIDv8 which has a much more flexible layout: just about any bit layout is valid as long as the version and variant fields are set correctly.

mkhorton commented 8 months ago

ULID was already merged into emmet, but I agree with your points @dpldgr. I’d much rather we try to stick with a standard, it seems safe to adopt UUIDv7 given how close it is to being finalized.

jmmshn commented 8 months ago

So things on both emmet and jobflow are kept flexible for the time being and the default behavior is basically "nothing changes" but if you want to use ULID or UUIDv1 for any reason right now you can.

It might make sense to have a couple of full builds of MP with atomate2 data first before we fully commit to a convention change. But one of the MP builders will have to chime in and participate on that.

Andrew-S-Rosen commented 8 months ago

@jmmshn --- out of curiosity and somewhat independent of integration in jobflow, do you have a recommended library to use UUIDv7 today (acknowledging it's possible, but unlikely, it'll substantially change in the future)? https://github.com/oittaa/uuid6-python?

mkhorton commented 8 months ago

This library looks good:

https://github.com/aminalaee/uuid-utils

I would avoid the “uuid7” library since apparently it’s non-compliant.

Discussion on uuid7 making it into the stdlib here:

https://discuss.python.org/t/add-uuid7-in-uuid-module-in-standard-library/44390/5

dpldgr commented 8 months ago

So things on both emmet and jobflow are kept flexible for the time being and the default behavior is basically "nothing changes" but if you want to use ULID or UUIDv1 for any reason right now you can.

From what I can tell, ULID is very very similiar to UUIDv7 in bit layout. It uses the same bit layout for the timestamp, but omits the variant and version fields that all UUIDs have (they have random values instead). So you could trivially make a ULID into a UUIDv7 by setting the variant field (bits 64-65) to binary 10 and the version field (bits 48-51) to binary 0111 . That would make for a much easier potential migration path than converting from UUIDv1 to UUIDv7.

I would avoid the “uuid7” library since apparently it’s non-compliant.

It is compliant, but with an old version of the draft. It should either be updated or removed as a package because there's obviously plenty of scope for confusion and unexpected results.

dpldgr commented 8 months ago

@jmmshn --- out of curiosity and somewhat independent of integration in jobflow, do you have a recommended library to use UUIDv7 today (acknowledging it's possible, but unlikely, it'll substantially change in the future)? https://github.com/oittaa/uuid6-python?

That one looks like a good choice to me. It's an active repo that has had plenty of releases, so you're much more likely to get support if it's needed. The 'uuid7' repo in comparison has been stagnant for quite some time.

materialsproject / jobflow

Why UUIDs? #519