NIO for files - Githubissues

1st1 commented 9 years ago

I've seen a few discussions around lack of non-blocking file io support in asyncio. Normally, we assume that disk io is fast and doesn't block, but there are use cases where it can block for a long time:

HDDs are still very common and widely used. Huge ones can be quite slow at random access.
Network filesystems - sshfs, nfs, aws ebs etc.

A workaround is to use a threadpool for file io, but that's suboptimal since it's hard to interact with asyncio code.

libuv library (the one that NodeJS is built on) uses threadpools internally, to provide non-blocking file io.

I see three options for asyncio:

Keep the status quo. Advise users to either assume that disk io never blocks, or use threadpools.
Provide a function that will return a pair of (StreamReader, StreamWriter). Loop will use a threadpool to do the actual IO.
Provide a file object with a similar API to FileIO, but with coroutines for most of its methods. Loop will use a threadpool to do the actual IO.

If (2) or (3) is acceptable, I volunteer to draft the initial implementation.

1st1 commented 9 years ago

See also #213.

gvanrossum commented 9 years ago

I agree it would be good to address this. But I would like to see a sketch of an implementation before deciding on an API.

jettify commented 9 years ago

It would be very helpful! Right now in our project, we basically wrap all file APIs with: loop.run_in_executor(). But most anoying is not wrapping all calls but lack of support of third party libraries.

asvetlov commented 9 years ago

See also aiofiles library: https://github.com/Tinche/aiofiles

ludovic-gasc commented 9 years ago

We use aiofiles for now, however, we never benchmark to have an idea if the most efficient is the threadpool pattern used by aiofiles or classical sync API provided by Python.

As usual, certainly depends on the context: size, number of files, read or write, physical hard drive or virtualized environment where the hard drive is in fact most on the ton the network...

For filesystems like NFS, the answer should be more obvious.

If somebody does a benchmark, I'm interested in.

1st1 commented 9 years ago

@gvanrossum What do you think about https://github.com/Tinche/aiofiles? Its approach is very lean -- use singledispatch to wrap all kind of objects that open() returns, and wrap them with proxies that offload all blocking operations to the loop's executor.

The main advantage is that the API is essentially the same as regular Python io, you just have to use await or yield from for most calls. Do you think this is a right way to go?

gvanrossum commented 9 years ago

Is it any good? Have you benchmarked it? What platforms does it support? Is anyone using it? There don't seem to be any issues filed against the project, ever, which makes me wonder if it's been battle-tested. The API design (as far as I can glance from the README) looks fine. I'm not sure I like the singledispatch-based implementation.

On Mon, Oct 26, 2015 at 2:26 PM, Yury Selivanov notifications@github.com wrote:

@gvanrossum https://github.com/gvanrossum What do you think about https://github.com/Tinche/aiofiles? Its approach is very lean -- use singledispatch to wrap all kind of objects that open() returns, and wrap them with proxies that offload all blocking operations to the loop's executor.

The main advantage is that the API is essentially the same as regular Python io, you just have to use await or yield from for most calls. Do you think this is a right way to go?

— Reply to this email directly or view it on GitHub https://github.com/python/asyncio/issues/279#issuecomment-151289476.

--Guido van Rossum (python.org/~guido)

1st1 commented 9 years ago

Is it any good?

It's a fine little library, I view it more as a proof of concept.

Have you benchmarked it?

No. I doubt it will perform exceptionally, given that it runs in a pure Python threadpool. OTOH its main objective is to be non-blocking.

If implemented in C, the file IO thread can release the GIL, hence providing a bit better performance. Or someone might implement asyncio loop on top of libuv, reusing their low-level non-blocking file IO APIs.

What platforms does it support?

I think any platform that asyncio runs on, since it's just a thin wrapper of io module.

Is anyone using it?

I don't think it's popular. People don't usually think that file IO can block, hence most of the current software simply ignores the problem. The ecosystem around asyncio is evolving very fast, and I think that we should provide some built-in API for file IO so that everyone will use it.

There don't seem to be any issues filed against the project, ever, which makes me wonder if it's been battle-tested. The API design (as far as I can glance from the README) looks fine. I'm not sure I like the singledispatch-based implementation.

To be clear: I don't propose to include this library "as is". It's a relatively small library, we can implement it from scratch and design it specifically to fit in the asyncio core, this is now clear. Do you think it's a good idea to mimic the existing io module (and open() function) API?

vstinner commented 9 years ago

2015-10-27 12:44 GMT+09:00 Yury Selivanov notifications@github.com:

If implemented in C, the file IO thread can release the GIL, hence providing a bit better performance. Or someone might implement asyncio loop on top of libuv, reusing their low-level non-blocking file IO APIs.

I don't understand this point. If it's a thin wrapper calling Python io module in threads: the Python io modules does release the GIL for any I/O.

What platforms does it support?

I think any platform that asyncio runs on, since it's just a thin wrapper of io module.

aiofiles would benefit of specialized implementations especially for Windows. On Windows, thanks to the IOCP, we can completly avoid threads. I'm not sure that current aiofiles design is prepared to support specialized implementations.

Linux might really support O_NONBLOCK on regular files in the kernel. Current on Linux, there are already two implementations for async I/O on regular files: Linux Kernal "aio" API, POSIX "aio" API "simply" wrapping all calls in threads (probably a thread pool). Sadly, the Linux kernel "aio" API has a well bottleneck on is "select"-like call to wait for I/O completion. Sadly again, it doesn't work with regular select/poll/epoll.

To be clear: I don't propose to include this library "as is". It's a relatively small library, we can implement it from scratch and design it specifically to fit in the asyncio core, this is now clear. Do you think it's a good idea to mimic the existing io module (and open() function) API?

I would prefer to keep aiofiles on PyPI until it is mature again, and so contribute to aiofiles. You may keep this issue open if you want to keep a central point to collect information about async I/O on regular files.

Reminder: I also proposed to remove asyncio from CPython stdlib because I don't consider it mature enough :-) For practical reasons, it's not easy to update asyncio using "pip install asyncio" (you have to modify sys.path).

1st1 commented 9 years ago

I don't want to focus this particular discussion on performance questions -- I think that the main issue is to arrive at some API design that would make sense in asyncio context. Making it fast is another thing.

I would prefer to keep aiofiles on PyPI until it is mature again, and so contribute to aiofiles. You may keep this issue open if you want to keep a central point to collect information about async I/O on regular files.

The point is to have non-blocking file IO API as part of asyncio. I think that every other popular async framework has it (nodejs, gevent, etc). aiofiles isn't popular now and likely won't be popular ever simply because it doesn't have good exposure. People will design their programs caring about network IO (because asyncio has APIs for it), and ignoring potential issues with file IO (because asyncio never mentions it). Simple logging to a file shared over network (or just slow hdd) can halt your http server for long period of time.

vstinner commented 9 years ago

2015-10-27 13:46 GMT+09:00 Yury Selivanov notifications@github.com:

I don't want to focus this particular discussion on performance questions -- I think that the main issue is to arrive at some API design that would make sense in asyncio context. Making it fast is another thing.

Do you want to provide an API with a "basic" implementation and let others to implement faster implementations (like using IOCP on Windows)?

1st1 commented 9 years ago

Do you want to provide an API with a "basic" implementation and let others to implement faster implementations (like using IOCP on Windows)?

Sure, let's define some core API on event loops. By default those will be implemented with threadpools, but nothing will stop from fine-tuning and optimizing them for specific platforms.

I also want some high-level convenient API (like readers/writers or file objects similar to io module) so that people can start using it.

gvanrossum commented 9 years ago

I'm mostly with Victor -- it seems too soon to bless anything. I wonder if the only thing that people really want is sendfile(), which aiofiles supports only on Linux?

python / asyncio

NIO for files #279