python / cpython

The Python programming language
https://www.python.org
Other
62.29k stars 29.93k forks source link

Windows uuid.uuid1 is unecessarily unsafe for concurrent use #118993

Open schlenk opened 4 months ago

schlenk commented 4 months ago

Bug report

Bug description:

Running uuid.uuid1 concurrently in multiple parallel python processes on Windows creates colliding UUIDs, due to the timestamps colliding. Can be seen with 10-20 concurent processes usually.

This is unnecessary, as python already wraps the UuidCreateSequential() API https://github.com/python/cpython/blob/9c1520244151f36e010c1b04bedf14747a28517d/Modules/_uuidmodule.c#L63 , but then discards the time part and falls back to some collision prone time code in https://github.com/python/cpython/blob/b4ca389281849e849fb58fecf9b31e2e2f5a39c1/Lib/uuid.py#L668 due to a claim that UuidCreate() does not follow RFC 4122.

But as far as i can tell, Windows UuidCreateSequential() does follow RFC 4122 by now, tested on Win10, so the comment looks wrong or badly aged. It is true, that the function returns the bytes in the wrong order, Little Endian, instead of the RFC prescribed Big Endian, but the rest looks sane.

This works just fine and creates a UUID thats pretty similar to the ones created by the current uuid.uuid1 implementation.

>>> import _uuid
>>> import uuid
>>> win_uuid = uuid.UUID(bytes_le=_uuid.UuidCreate())
>>> win_uuid.version
1
>>> win_uuid.variant
'specified in RFC 4122'
>>> win_uuid.time
139348915287723566
>>> win_uuid.node
185263057677337
>>> py_uuid = uuid.uuid1()
>>> py_uuid.version
1
>>> py_uuid.variant
'specified in RFC 4122'
>>> py_uuid.time
139348915744502628
>>> py_uuid.node
185263057677337

I would propose to remove the unsafe time fallback for Windows and use the results of UuidCreate() to make it safe.

CPython versions tested on:

3.11

Operating systems tested on:

Windows

zooba commented 1 month ago

Apologies for leaving this one for so long.

To clarify my own understanding, the reason this only affects Windows is because POSIX is always going to provide _generate_time_safe and so won't go through the pure-Python calculation? (And so a system that doesn't provide that function will also be affected, even if it's not Windows.)

It seems like UuidCreateSequential is the right alternative, though I'm a little surprised that we got to this point at all? That API has been around for a long time (the rewrite to use _uuid rather than ctypes is more recent, but that was a straight conversion and didn't involve any consideration for whether the results are RFC compliant or not).

It is true, that the function returns the bytes in the wrong order, Little Endian, instead of the RFC prescribed Big Endian, but the rest looks sane.

You just mean we need to construct with bytes_le= rather than bytes=, correct? It doesn't change the values compared to the current implementation? (It's compliant enough with the spec to do this, in that the RFC only specifies network order "in the absence" of another specification - Windows specifies that it uses native endian for each field, so it wins here.)

If so, then the entire proposal is to simply add if _UuidCreate is not None: return UUID(bytes_le=_UuidCreate()) plus overwriting the two additional parameters if specified. Have I understood the situation well enough?

schlenk commented 1 month ago

To clarify my own understanding, the reason this only affects Windows is because POSIX is always going to provide _generate_time_safe and so won't go through the pure-Python calculation? (And so a system that doesn't provide that function will also be affected, even if it's not Windows.)

Yes. If there is no HAVE_UUID_GENERATE_TIME_SAFE define, which seems to test some function from uuid.h, i think it would also affect other systems. https://github.com/python/cpython/blob/9c1520244151f36e010c1b04bedf14747a28517d/Modules/_uuidmodule.c#L101

But the point is mostly that you have no alternatives on POSIX systems without that libuuid function, while you have a perfectly fine alternative on Win32 that is just not used.

It seems like UuidCreateSequential is the right alternative, though I'm a little surprised that we got to this point at all?

This was already broken since forever, i had a local patch against some ancient Py 2.6 version, but assumed it got fixed in Py3, without checking at that time. And got surprised to find it still happening in Py3, when finally getting off the ancient codebase.

It doesn't change the values compared to the current implementation?

I saw a small time offset, see console listing above, which might be caused by different clocks used internally. So the pure python version creates slightly different timestamps than the UuidCreateSequential API. I would assume this does not cause an issue.

If so, then the entire proposal is to simply add if _UuidCreate is not None: return UUID(bytes_le=_UuidCreate()) plus overwriting the two additional parameters if specified. Have I understood the situation well enough?

Yes. This should also set the is_safe parameter to SafeUUID.safe, as the UUIDs created by the UuidCreateSequential() are as safe as the ones by libuuid.