python / cpython

The Python programming language
https://www.python.org
Other
63.49k stars 30.41k forks source link

Strategy for Iterators in Free Threading #124397

Open rhettinger opened 1 month ago

rhettinger commented 1 month ago

This is an umbrella issue to capture planning and strategy discussions at the sprints.

Our draft plan has three points:

1) Add a new itertool, serialize(), which will take a non-threadsafe iterator as input and produce a new iterator that is thread-safe. Multiple threads can access the new iterator which is guaranteed to make serial (one-at-a-time) calls to the upstream iterators. This will be implemented with locks that block __next__ calls while others are pending. The implementation will not buffer calls; instead, it implements blocking to achieve serialization. If applicable, send() and throw() method calls will be forwarded as well.

2) The itertools.tee() code will have guaranteed semantics. It can take a non-threadsafe iterator from one thread as an input and provide tee objects for other threads to get reliable independent copies of the data stream. The new iterators are only thread-safe if consumed within a single thread. Internally, it buffers data to fulfill this contract.

3) Other iterators implemented in C will get only the minimal changes necessary to cause them to not crash in a free-threaded build. The edits should be made in a way that does not impact existing semantics or performance (i.e. do not damage the standard GIL build). Concurrent access is allowed to return duplicate values, skip values, or raise an exception.

eendebakpt commented 1 month ago

@rhettinger Thanks for the update. Will there be more updates (more sprints)? I have two questions about the strategy above:

i) Is the part about not impacting performance only about the GIL build, or also about single-threaded iteration in the free-threading builds?

ii) What about the following cases:

Should we guarantee "correct" iteration for these case?

serhiy-storchaka commented 1 month ago

+1 for serialize(). I've been wanting to implement this for a while, I just wasn't sure about the name and what module it should be in (itertools or threading).

Note that generator objects are not thread-safe. You cannot use the same generator objects concurrently in different threads -- you will get a RuntimeError if other thread already executes the generator code. So such wrapper was needed long before free threading.

rhettinger commented 1 month ago

[pieter]

i) Is the part about not impacting performance only about the GIL build, or also about single-threaded iteration in the free-threading builds?

Just the first one. There isn't much we can do for the second one because some anti-race logic needs to replace the current reliance on the GIL.

[pieter]

ii) What about the following cases:

The only requirement is to not crash if an application makes concurrent __next__ calls.

Iterators aren't limited to one thread. They can be created in one, used in another, and later used in another. If needed, a user can (and should) manage contention by adding their own locks or some higher level threading API just like they would with any other shared resource.

[serhiy]

I just wasn't sure about the name and what module it should be in (itertools or threading).

Either module would be a reasonable choice. Are you happy with the name, serialize()?

serhiy-storchaka commented 1 month ago

At that time there was only C API for non-reentrant lock, and I was not sure that this is enough. Reentrant lock would make itertools depending on threading, and this does not look good to me. But I see that there is more private C API for locking, so this may be not needed.

As for the name, serialize does not tell anything about its behavior to me. I think you have reasons for this name. I only though about something ugly like thread_safe_iter.

The implementation is trivial:

class serialize(Iterator):
    def __init__(self, it):
        self._it = it
        self._lock = Lock()  # or RLock()?
    def __next__(self):
        with self._lock:
            return next(self._it)