Open janosh opened 9 months ago
I'm not against using shorter IDs, especially since they're currently encoded as string not bson uuids. Is there a way to generate those short IDs without an additional dependency?
Yes, we could do
import random
import string
def get_random_id():
"""Return a random 11-character YouTube-like ID."""
chars = random.choices(string.ascii_letters + string.digits + "-_", k=11)
return "".join(chars)
Maybe pastibility would be better if we drop the dash and underscore.
def get_random_id():
"""Return a random 11-character alpha-numeric ID."""
chars = random.choices(string.ascii_letters + string.digits, k=11)
return "".join(chars)
I think that making the uuid shorter and easier for copy/paste would indeed be beneficial, but I have a few notes that may be considered before proceeding:
uuid
. Changing that would require some refactoring and maybe backward incompatible changes, not changing means that a uuid
attribute contains something that is not a proper uuid@gpetretto Great points both!
jf-
which would allow identifying them via id.startwith('jf-')
and would even allow future changes of the ID format without breaking these checks.uuid
to id
(or jf_id
) since keeping uuid
would be misleading.Thanks for considering these points. I have a few more comments.
If changing the id format is an option, it may also be convenient to consider other algorithms and their features. For example something sortable. Here are a few examples I have come across:
python implementations are typically available and they probably won't require further dependencies, but I suppose it would also be easy to add a simple reimplementation inside jobflow.
I was looking at NanoID when opening this issue but ULID looks even better. Only thing is 26 vs 36 is not that much shorter.
It is not 11^64 but 64^11 ~ 7.37E19
Oops, rookie mistake 😅
Maybe I'm not thinking hard enough about it, but is there any reason not to have the unique ID simply be the datetime + a small identifier?
time_now = datetime.now(timezone.utc).strftime("%Y-%m-%d-%H-%M-%S-%f")
f"{time_now}-{SMALL_ID_GOES_HERE}"
Then it would be interpretable but also with low probability of clashing even if SMALL_ID_GOES_HERE
is not particularly diverse.
I guess mine isn't shorter though, so I take it back. 😅
The sortable point from @gpetretto is pretty important. Some of the more complex workflows in emmet require the IDs to be sorted (ideally based on creating time). So if we move away from UUID4 (which I think is random), I think this is actually a requirement.
Otherwise, we run into problems like you see here: https://github.com/materialsproject/atomate2/pull/655#issuecomment-1891260166
Agreed that given the emmet limitations, having a sortable id is essential.
Of the sortable options mentioned
Are there any other options/suggestions on what to use?
Just chiming in to say that, while it doesn't help now, looks like uuidv7 might address some of these needs in the future (e.g. for other projects). https://buildkite.com/blog/goodbye-integers-hello-uuids
Cool! uuidv7 support in std lib is tracked in https://github.com/python/cpython/issues/89083 with a planned release in Python 3.13.
We could install uuid7
in the meantime or copy-paste this function into jobflow
@janosh just a note: https://github.com/stevesimmons/uuid7/issues/1#issuecomment-1663402643
Maybe just use UUID1 for now, and when we add the support for UUIDS in emmet we can write it with both UUID1 and UUID7 in mind.
I think that makes the most sense of the available options.
Ok, happy to go with that. @jmmshn would you be able to submit a PR? I will then release a new version of jobflow.
OK!
I think that also the initial points raised by @janosh in the first message are worth considering, i.e. the ability to quickly copy/paste and having a shorter ID. uuid1 and uuid7 address the sortability, but not the other two issues. ULID would address all of them (10 characters shorter, no hyphens, sortable). Why not considering that instead?
I would also be happy with ULID. The only drawback is that we would need to either:
What I'm not clear about is whether we'd also need to do the same for emmet in order to use the sorting features.
Ditto @janosh and @gpetretto on the size thing. since the UUID bit of code in jobflow is so small can't we just move it all to emmet core?
Jobflow doesn't depend on emmet though (only atomate2 does).
From some quick reading, it seems like the sorting should be trivial without needing ULID as a dependency.
Maybe in monty? Both emmet and joblow depend on monty.
I think we can just geta PR going with UUID1 for now and add others later.
I would like to point out two other smaller issues with uuid1: 1) it is not really sortable. The order needs to be reconstructed at a later time. Which is definitely a downside as it could be useful in DB queries. 2) For a single user the number of machines used to generate a new uuid are likely limited. New workflows are typically generated on the same machine and then added to the DB. So for a single user uuid1 can be seen as a very long id that just contains the timestamp.
As a general comment, I would avoid changing multiple times the algorithm for the id generation, as this may be confusing or leading to unexpected issues. Even changing it once may be a reason for concern. According to the website linked above, the definition of the uuid7 standard should be in its final stage. So I am wondering if it would not be better to wait for its release in order to minimize the number of changes.
I would like to point out two other smaller issues with uuid1:
- it is not really sortable. The order needs to be reconstructed at a later time. Which is definitely a downside as it could be useful in DB queries.
- For a single user the number of machines used to generate a new uuid are likely limited. New workflows are typically generated on the same machine and then added to the DB. So for a single user uuid1 can be seen as a very long id that just contains the timestamp.
As a general comment, I would avoid changing multiple times the algorithm for the id generation, as this may be confusing or leading to unexpected issues. Even changing it once may be a reason for concern. According to the website linked above, the definition of the uuid7 standard should be in its final stage. So I am wondering if it would not be better to wait for its release in order to minimize the number of changes.
I also think it might be better to not change things too many times as jobflow (and atomate2) are already used in production by many people. One thing that was discussed above was also the renaming of the uuid attribute (to id or jf_id). This might also be something to keep in mind (as @gpetretto mentioned, having a uuid attribute which is not a "proper" uuid could make things confusing). Such a change could be done earlier with a deprecation. Not sure I would call it id nor jf_id though. Maybe uid (for unique id) ?
I agree with the naming issue, and a change to uid
is probably needed.
I think we can enable support for multiple id types, keep it at uuid4
as default for now, and swap it over to ulid
or uuid7
at a later time.
This will give me some kind of sortable id immediately so I can proceed with my stuff (just have a hack in a comparator for the uuid1's) then we can wait a while to swap it over to ulid
as default after we have all used it in testing for a while.
Also, if someone is already committed to using uuid4 they can just tweak their jobflow.yaml
and not disrupt anything even after the switch.
One important factor here is that MongoDB does have a native UUID datatype, which reduces the number of bytes to store it compared to a string representation. Since ULID is bit-compatible with UUID, it’s probably possible to switch away from using a string too.
That said, I’m not convinced about the argument against uuid7 if it’s a standard and implementations are imminent. If the size on disk is the same, I’m not sure it matters if the string representation has hyphens or not — and indeed, the hyphens help with readability.
Maybe just use UUID1 for now, and when we add the support for UUIDS in emmet we can write it with both UUID1 and UUID7 in mind.
A few comments on UUIDv1 and UUIDv7:
ULID was already merged into emmet, but I agree with your points @dpldgr. I’d much rather we try to stick with a standard, it seems safe to adopt UUIDv7 given how close it is to being finalized.
So things on both emmet and jobflow are kept flexible for the time being and the default behavior is basically "nothing changes" but if you want to use ULID or UUIDv1 for any reason right now you can.
It might make sense to have a couple of full builds of MP with atomate2 data first before we fully commit to a convention change. But one of the MP builders will have to chime in and participate on that.
@jmmshn --- out of curiosity and somewhat independent of integration in jobflow, do you have a recommended library to use UUIDv7 today (acknowledging it's possible, but unlikely, it'll substantially change in the future)? https://github.com/oittaa/uuid6-python?
This library looks good:
https://github.com/aminalaee/uuid-utils
I would avoid the “uuid7” library since apparently it’s non-compliant.
Discussion on uuid7 making it into the stdlib here:
https://discuss.python.org/t/add-uuid7-in-uuid-module-in-standard-library/44390/5
So things on both emmet and jobflow are kept flexible for the time being and the default behavior is basically "nothing changes" but if you want to use ULID or UUIDv1 for any reason right now you can.
From what I can tell, ULID is very very similiar to UUIDv7 in bit layout. It uses the same bit layout for the timestamp, but omits the variant and version fields that all UUIDs have (they have random values instead). So you could trivially make a ULID into a UUIDv7 by setting the variant field (bits 64-65) to binary 10 and the version field (bits 48-51) to binary 0111 . That would make for a much easier potential migration path than converting from UUIDv1 to UUIDv7.
I would avoid the “uuid7” library since apparently it’s non-compliant.
It is compliant, but with an old version of the draft. It should either be updated or removed as a package because there's obviously plenty of scope for confusion and unexpected results.
@jmmshn --- out of curiosity and somewhat independent of integration in jobflow, do you have a recommended library to use UUIDv7 today (acknowledging it's possible, but unlikely, it'll substantially change in the future)? https://github.com/oittaa/uuid6-python?
That one looks like a good choice to me. It's an active repo that has had plenty of releases, so you're much more likely to get support if it's needed. The 'uuid7' repo in comparison has been stagnant for quite some time.
I was looking at a wall of job IDs like this earlier and wondering if
jobflow
actually requires full 36-character UUIDs?YouTube for example uses 11-character IDs?
It's case-sensitive alpha-numeric plus hyphen and underscore giving $64^{11} \approx 7.4 \cdot 10^{19}$ combinations. I think there would be readability and "pastibility" improvements.