python / cpython

The Python programming language
https://www.python.org
Other
63.55k stars 30.45k forks source link

Add API to io objects for cache-only reads/writes #76742

Open njsmith opened 6 years ago

njsmith commented 6 years ago
BPO 32561
Nosy @pitrou, @vstinner, @giampaolo, @benjaminp, @njsmith, @jab, @vadmium, @YoSTEALTH, @xgid
Dependencies
  • bpo-32475: Add ability to query number of buffered bytes available on buffered I/O
  • Files
  • poc_aio.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['3.8', 'expert-IO'] title = 'Add API to io objects for cache-only reads/writes' updated_at = user = 'https://github.com/njsmith' ``` bugs.python.org fields: ```python activity = actor = 'jab' assignee = 'none' closed = False closed_date = None closer = None components = ['IO'] creation = creator = 'njs' dependencies = ['32475'] files = ['48656'] hgrepos = [] issue_num = 32561 keywords = [] message_count = 19.0 messages = ['310032', '310036', '310041', '310042', '310048', '310050', '310067', '310069', '310088', '319206', '319214', '319215', '354343', '354347', '354349', '354362', '354412', '354445', '354446'] nosy_count = 10.0 nosy_names = ['pitrou', 'vstinner', 'giampaolo.rodola', 'benjamin.peterson', 'stutzbach', 'njs', 'jab', 'martin.panter', 'YoSTEALTH', 'xgdomingo'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = None url = 'https://bugs.python.org/issue32561' versions = ['Python 3.8'] ```

    njsmith commented 6 years ago

    Background: Doing I/O to files on disk has a hugely bimodal latency. If the I/O happens to be in or going to cache (either user-space cache, like in io.BufferedIOBase, or the OS's page cache), then the operation returns instantly (~1 µs) without blocking. OTOH if the I/O isn't cached (for reads) or cacheable (for writes), then the operation may block for 10 ms or more.

    This creates a problem for async programs that want to do disk I/O. You have to use a thread pool for reads/writes, because sometimes they block for a long time, and you want to let your event loop keep doing other useful work while it's waiting. But dispatching to a thread pool adds a lot of overhead (~100 µs), so you'd really rather not do it for operations that can be serviced directly through cache. For uncached operations a thread gives a 100x speedup, but for cached operations it's a 100x slowdown, and -- this is the kicker -- there's no way to predict which ahead of time.

    But, io.BufferedIOBase at least knows when it can satisfy a request directly from its buffer without issuing any syscalls. And in Linux 4.14, it's even possible to issue a non-blocking read to the kernel that will only succeed if the data is immediately available in page cache (bpo-31368).

    So, it would be very nice if there were some way to ask a Python file object to do a "nonblocking read/write", which either succeeds immediately or else raises an error. The intended usage pattern would be:

    async def read(self, *args):
        try:
            self._fileobj.read(*args, nonblock=True)
        except BlockingIOError: # maybe?
            return await run_in_worker_thread(self._fileobj.read, *args)

    It would *really* help for this to be in the Python core, because right now the convenient way to do non-blocking disk I/O is to re-use the existing Python I/O stack, with worker threads. (This is how both aiofiles and trio's async file support work. I think maybe curio's too.) But to implement this feature ourselves, we'd have to first reimplement the whole I/O stack, because the important caching information, and choice of what syscall to use, are hidden inside.

    vadmium commented 6 years ago

    BufferedIOBase is an abstract class and, despite the name, doesn’t necessitate a buffer or cache. Adding methods and properties might break compatibility with third-party implementations, or get ugly with optional methods and multiple versions of the API. It seems like it would be better to extend the concrete APIs: io.BufferedReader, BufferedWriter and/or FileIO.

    In bpo-32475 there is a proposal to add a “getbuffn” method returning the amount of unread pending data in a reader object. Perhaps that would be enough for reading.

    I would support an similar API for BufferedWriter etc. Perhaps a property called “available_space”. You could check that and decide whether to do a direct non-blocking “write”, or launch a blocking “write” in the background.

    njsmith commented 6 years ago

    BufferedIOBase is an abstract class and, despite the name, doesn’t necessitate a buffer or cache

    Right, sorry, skimmed too fast.

    In bpo-32475 there is a proposal to add a “getbuffn” method returning the amount of unread pending data in a reader object. Perhaps that would be enough for reading.

    Ideally we would be able to do buffer-only reads through all the of the different read methods (read, readline, readinto, ...), and ideally we would be able to do it given objects at different points in the IO stack – so a buffer-only read on TextWrapper wrapped around a BufferedRandom wrapped around a FileIO, should propagate the buffer-only-ness all the way down the stack. I don't think getbuffn is enough to solve that? Or at least I don't see how in any simple way.

    Also, the immediate thing that spurred me to file this issue was learning that Linux has just added a non-blocking file read syscall. It would be pretty neat if we could expose that. If we had a way to propagate this down, then it could just be FileIO's implementation of the buffer-only flag.

    But yeah, actually doing that is complicated given the need to continue supporting existing implementations of these interfaces.

    Here's a straw man proposal: add a unbuffered_supported flag to the abstract IO interfaces. If missing or false, you can't do unbuffered reads/writes. If present and True, then you can pass a new unbuffered=True kw-only argument to their read/write calls.

    When (for example) TextWrapper.read needs to call its wrapped object's .read, it does a check like:

      if unbuffered:
          # This call is unbuffered, so we're only allowed to call
          # unbuffered methods.
          if not getattr(self.wrapped, "unbuffered_supported", False):
              # lower level doesn't support this, can't be done
              raise ...
          else:
              self.wrapped.read(..., unbuffered=True)
      else:
          # We're a regular call, so we can make a regular call
          self.wrapped.read(...)

    (I'm intentionally using "unbuffered" here to distinguish from regular POSIX "non-blocking", which is an API that's conceptually very similar but totally distinct in implementation. Especially since it's also possible to use the io stack with sockets/pipes in non-blocking mode.)

    pitrou commented 6 years ago

    Ideally we would be able to do buffer-only reads through all the of the different read methods (read, readline, readinto, ...),

    Hmm... We already have non-blocking support in BufferedIOReader, except it *doesn't work*. The problem is, the semantics mandated by readline() and even buffered read() don't work very well with non-blocking IO (see bpo-13322).

    So my opinion here is that only raw IO objects (FileIO) should have this functionality. People can build their own functionality on top of that (such as Tornado or asyncio do with their streams).

    njsmith commented 6 years ago

    Hmm, why did I use "unbuffered" as my term above? That's a very odd name. It's like, exactly the opposite of what we actually want. Clearly I did not think this through properly. Please pretend I said "buffer-only" instead, thanks.

    So my opinion here is that only raw IO objects (FileIO) should have this functionality. People can build their own functionality on top of that (such as Tornado or asyncio do with their streams).

    I guess I don't object to such functionality, but it would be useless to me personally. FileIO doesn't solve any problems I have with stream processing; the reason I care about this is for providing an async file I/O API. And all the async file IO APIs I know [1][2][3] have the public API of "just like regular files, but the blocking methods are async". 99% of the time, that means TextWrapper and BufferedStream. So I just don't see any way to get the advantages of this feature without either (a) adding buffer-only support to those layers, or (b) forking those layers into a 3rd-party library, and then adding buffer-only support.

    OTOH, it would be ok if in an initial implementation some methods like readline() simply always failed when called in buffer-only mode, since this would be a best-effort thing. (This is also a different from the non-blocking semantics discussion in bpo-13322, which is kind of scary. I don't want to deal with partial writes and reads and carrying crucial data in exceptions! I just want to know if the operation can trivially be done without blocking, and if not then I'll retry it in blocking mode.)

    [1] https://github.com/Tinche/aiofiles [2] https://trio.readthedocs.io/en/latest/reference-io.html#asynchronous-filesystem-i-o [3] https://curio.readthedocs.io/en/latest/reference.html#module-curio.file

    pitrou commented 6 years ago

    And all the async file IO APIs I know [1][2][3] have the public API of "just like regular files, but the blocking methods are async". 99% of the time, that means TextWrapper and BufferedStream. So I just don't see any way to get the advantages of this feature without either (a) adding buffer-only support to those layers, or (b) forking those layers into a 3rd-party library, and then adding buffer-only support.

    Yeah... The concrete problem is that there's already a poorly thought-out "non-blocking mode" that only partly works, and suddenly the code (which includes a lot of delicate, performance-critical C code... to give you an idea, even recently a consistency bug in truncate() was discovered) will have to be massaged to support another non-blocking mode of operation.

    So that's why I'm very cautious about integrating this into BufferedReader and friends.

    njsmith commented 6 years ago

    That's a reasonable concern. Do you think we can deprecate the existing broken non-blocking mode?

    pitrou commented 6 years ago

    Do you think we can deprecate the existing broken non-blocking mode?

    I would suggest asking on python-dev. I wouldn't mind it, but perhaps there are people using it.

    b9e3e125-c758-471e-8446-bb5afa99e9cd commented 6 years ago

    There will be lot of confusion using "buffered" & "unbuffered" terminology, since python already has BufferedIOBase (as mentioned by Martin). It would be more appropriate to create io.CachedIOBase and add non-blocking argument to open(blocking=False) to enable this feature.

    giampaolo commented 6 years ago

    os.preadv() and os.pwritev() are great but to my understanding one essential piece is still missing in order to effectively do non-blocking file IO and avoid using a thread pool: being notified when the file fd is readable/writable. select() and epoll() on Linux are not able to do that (according to them regular fds are always "ready"). As such one would repeatedly get EAGAIN and hog CPU resources. Am I missing something?

    njsmith commented 6 years ago

    The idea here is *not* to avoid using a thread pool in general. When the data is on disk, using a thread pool is (a) unavoidable, because of how operating system kernels are written, and (b) basically fine anyway, because the overhead added by threads is swamped by the cost of disk access. So for the foreseeable future, we're always going to be using a thread pool for actual disk access.

    But, if the data *is already in memory, so the read can succeed without hitting the disk, then using a thread pool is *not fine. Fetching data out of memory is super super cheap, so if that's all we're doing then using a thread pool adds massive overhead, in relative terms. We'd like to skip using the thread pool *specifically in this case*.

    So the idea would be: first, attempt a "buffer-only" read. If it succeeds, then great we're done and it was really cheap. Otherwise, if it fails, then we know we're in the data-on-disk case, so we dispatch the operation to the thread pool.

    giampaolo commented 6 years ago

    Gotcha. Thanks for clarifying.

    vstinner commented 5 years ago

    Background: Doing I/O to files on disk has a hugely bimodal latency. If the I/O happens to be in or going to cache (either user-space cache, like in io.BufferedIOBase, or the OS's page cache), then the operation returns instantly (~1 µs) without blocking. OTOH if the I/O isn't cached (for reads) or cacheable (for writes), then the operation may block for 10 ms or more.

    On Linux 4.14 and newer, Python 3.8 now provides os.preadv(os.RWF_NOWAIT):

    "Do not wait for data which is not immediately available. If this flag is specified, the system call will return instantly if it would have to read data from the backing storage or wait for a lock. If some data was successfully read, it will return the number of bytes read. If no bytes were read, it will return -1 and set errno to errno.EAGAIN."

    At least on recent Linux, it became possible to write a different code path for uncached data.

    vstinner commented 5 years ago

    The Linux kernel 5.1 also got a new "io_uring" for asynchronous I/O:

    "Ringing in a new asynchronous I/O API" https://lwn.net/Articles/776703/

    Linux 5.2: "The io_uring mechanism has a new operation, IORING_OP_SYNC_FILE_RANGE, which performs the equivalent of a sync_file_range() system call. It is also now possible to register an eventfd with an io_uring and get notifications when operations complete."

    Linux 5.3: "The io_uring mechanism has gained support for asynchronous sendmsg() and recvmsg() operations."

    vstinner commented 5 years ago

    I suggest to close the issue and move the discussion to a place to discuss asynchronous ideas.

    I'm sorry, but I don't understand what it is proposed here. I understand that Nathaniel wants to add something like a new "asynchronous" mode in the io module which would make FileIO, BufferedReader and TextIOWrapper behave differently.

    IMHO it's a bad idea. The io module is designed for blocking I/O syscalls. Not only the implementation, but also the API.

    Non-blocking I/O requires a platform specific implementation for best performances, but that requires something like an event loop, and so unusual programming style like asyncio "await ...".

    I dislike the idea of having a single module for synchronous (blocking) and asynchronous (non-blocking) operations. IMHO asynchronous programming is so complex that it requires to develop a whole new module.

    Maybe new module could reuse io code. Like implement an asynchronous using io.TextIOWrapper, but its underlying buffer would be feeded and controlled by asynchronous code.

    The Python bug tracker is usually used for bugs or to implement a concrete proposal. Here I understand that it's more an idea at the design stage. I don't think that it's the best place to discuss it. I suggest to open a discussion on python-ideas list or a list about asynchronous programming (I looked for "async-sig", but it seems like the list is gone?).

    njsmith commented 5 years ago

    The proposal is to be able to run io module operations in two modes: the regular one, and one where performing actual I/O is forbidden – so if they go down the stack and can fulfill the operation from some in-memory buffer, great, they do that, and if not, they raise an error.

    It turns out that this is actually the right primitive to enable async disk access. That's the only target use case, and there's no IO loop involved. If you wanted to keep async disk access separate from the io module, then what we'd have to do is to create a fork of all the code in the io module, and add this feature to it. Which doesn't seem like a good design :-).

    njsmith commented 5 years ago

    If you wanted to keep async disk access separate from the io module, then what we'd have to do is to create a fork of all the code in the io module, and add this feature to it.

    Thinking about this again today, I realized there *might* be another option.

    The tricky thing about supporting async file I/O is that users want the whole io module interface, and we don't want to have to reimplement all the functionality in TextIOWrapper, BufferedReader, BufferedWriter, etc. And we still need the blocking functionality too, for when we fall back to threads.

    But, here's a possible hack. We could implement our own version of 'FileIO' that wraps around a real FileIO. Every operation just delegates to the underlying FileIO – but with a twist. Something like:

    def wrapped_op(self, *args):
        if self._cached_op.key == (op, args):
            return self._cached_op.result
        if MAGIC_THREAD_LOCAL.io_is_forbidden:
            def cache_filler():
                MAGIC_THREAD_LOCAL.io_is_forbidden = False
                self._cached_op = self._real_file.op(*args)
            raise IOForbiddenError(cache_filler)
        return self._real_file.op(*args)

    And then in order to implement an async operation, we do something like:

    async def op(self, *args):
        while True:
            try:
                # First try fulfilling the operation from cache
                MAGIC_THREAD_LOCAL.io_is_forbidden = True
                return self._io_obj.op(*args)
            except IOForbiddenError as exc:
                # We have to actually hit the disk
                # Run the real IO operation in a thread, then try again
                await in_thread(cache_filler)
            finally:
                del MAGIC_THREAD_LOCAL.io_is_forbidden

    This is pretty convoluted: we keep trying the operation on the outer "buffered" object, seeing which low-level I/O operation it gets stuck on, doing that I/O operation, and trying again. There's all kinds of tricky non-local state here; like for example, there isn't any formal guarantee that the next time we try the "outer" I/O operation it will end up making exactly the same request to the "inner" RawIO object. If you try performing I/O operations on the same file from multiple tasks concurrently then you'll get all kinds of havoc. But if it works, then it does have two advantages:

    First, it doesn't require changes to the io module, which is at least nice for experimentation.

    And second, it's potentially compatible with the io_uring style of async disk I/O API. I don't actually know if this matters; if you look at the io_uring docs, the only reason they say they're more efficient than a thread pool is that they can do the equivalent of preadv(RWF_NOWAIT), and it's a lot easier to add preadv(RWF_NOWAIT) to a thread pool than it is to implement io_uring. But still, this approach is potentially more flexible than my original idea.

    We'd still have to reimplement open() in order to set up our weird custom IO stacks, but hopefully that's not *too* bad, since it's mostly just a bunch of if statements to decide which wrappers to stick around the raw IO object.

    My big concern is, that I'm not actually sure if this works :-).

    The thing is, for this to work, we need TextIOWrapper/BufferedReader/BufferedWriter to be very well-behaved when the underlying operation raises an exception. In particular, if they're doing a complex operation that requires multiple calls to the underlying object, and the second call raises an exception, they need to keep the first call's results in their buffer so that next time they can pick up where they left off. And I have no idea if that's true.

    I guess if you squint this is kind of like the non-blocking support in the io module – IOForbiddenError is like NonBlockingError. The big difference is that here, we don't have any "partial success" state at the low-level; either we do the operation immediately, or we punt and do the operation in a thread. Either way it completes as a single indivisible unit. So that might simplify things? From a quick skim of bpo-13322 it sounds like a lot of the problems with the current "non-blocking" mode come from these partial success states, but I haven't read it in detail.

    vstinner commented 5 years ago

    Here a proof-of-concept of an asynchronous io module reusing the existing blocking io module: it implements AsyncBufferedReader.readline() using existing _pyio.BufferedReader.readline().

    The approach seems to work, but only if bpo-13322 is fixed first: BufferedReader, TextIOWrapper & friends must return None if the underlying object ("raw" and "buffer" objects) return None.

    --

    My PoC uses 3 classes:

    At the first read, FileIOSandwich.read(n) call returns None, but it stores the request read size (n).

    If AsyncBufferedReader gets None, is calls FileIOSandwich._prepare_read() *asynchronously*/

    Then FileIOSandwich.read(n) is called again, and this time it no longer blocks, since data has been already read.

    --

    Since bpo-13322 is not fixed, my PoC uses _pyio since it's easier to fix. It needs the following fix for _pyio.BufferedReader.readline():

    diff --git a/Lib/_pyio.py b/Lib/_pyio.py
    index c1bdac7913..e90742ec43 100644
    --- a/Lib/_pyio.py
    +++ b/Lib/_pyio.py
    @@ -557,6 +557,8 @@ class IOBase(metaclass=abc.ABCMeta):
             res = bytearray()
             while size < 0 or len(res) < size:
                 b = self.read(nreadahead())
    +            if b is None and not res:
    +                return None
                 if not b:
                     break
                 res += b

    Example:

    $ ./python poc_aio.py poc_aio.py 
    data: b'import asyncio\n'

    Internally, _prepare_read() reads 8 KiB.

    vstinner commented 5 years ago

    I suggest to leave not attempt to put "async or "await" in the io module to keep it a "simple" as possible, but fix bpo-13322 (in io and _pyio modules).

    Instead, I suggest to write a new module only providing asynchronous methods, maybe even for open() and close(). But the new module can reuse the existing io module to avoid having to write complex algorithm like read-ahead, buffering, etc.

    Well, I'm not 100% sure that it's doable, since io is hiding many implementation details, there are complex issues like multithreading, locks, interlaced read and write operations, etc.

    Note: The io module doesn't fully suppored interlaced read and write :-) See bpo-12215 for example.