os: "async" file IO - Githubissues

dvyukov commented 10 years ago

Read/Write file operations must not consume threads, and use the polling similar to net
package.

This was raised several times before.

Here is a particular example with godoc:
https://groups.google.com/d/msg/golang-nuts/TeNvQqf4tO4/dskZuFH5QVYJ

minux commented 10 years ago

Comment 1:

what about stat syscalls? readdir?
one more reason to expose the poller?

dvyukov commented 10 years ago

Comment 2:

> what about stat syscalls? readdir?
I don't know for now. Do you have any thoughts?
> one more reason to expose the poller?
Yes, it is related.

minux commented 10 years ago

Comment 3:

I don't think there is asynchronous stat syscall available, and IIRC, most
event based web servers take great pain to optimize the stat(2)-taking-up-a-thread
problem (e.g. dedicated stat(2) thread pools)
Similarly for readdir, is there a pollable version available?
I don't know if readdir/stat is contributing to the godoc problem, but I think they
might be a problem if the GOPATH is large enough.

ianlancetaylor commented 10 years ago

Comment 4:

The pollable version of readdir is getdents, which the syscall package already uses.

bradfitz commented 10 years ago

Comment 5:

This continually bites me.  I have an interface that has both network and filesystem
implementations and the network one works great (firing off a bounded number of
goroutines: say, 1000) but then the filesystem implementation of the interface kills the
OS, and my code has to manually limit itself, which feels like a layering violation. 
The runtime should do the right thing for me.
runtime/debug.SetMaxThreads sets the max threads Go uses before it blows up.
If we can't do async filesystem IO everywhere (and I don't think we can), then I think
we should have something like runtime/debug.SetMaxFSThreads that controls the size of
the pool of thread doing filesystem operations but blocks instead of crashes.  That
means for the read/write syscalls, we'd have to know which fds are marked non-blocked
(network stuff) and which aren't.
Or we put all this capping logic in pkg/os, perhaps opt-in.
pkg os
func SetMaxFSThreads(n int)

dvyukov commented 10 years ago

Comment 6:

Do you use read/write? Or more involved ops like readdir?
Can you create a repro for this issue?

bradfitz commented 10 years ago

Comment 7:

I use read, write, open, close, readdir, stat, lstat.  godoc is a repro.  camlistore is
a repro.  My original Go bug report to Russ on May 5, 2010 was a repro.
I'll write something small, though, and attach it here.

rsc commented 10 years ago

Comment 8:

Labels changed: added release-none, removed go1.3maybe.

rsc commented 10 years ago

Comment 9:

Labels changed: added repo-main.

dvyukov commented 9 years ago

Comment 10:

FTR, net.FileConn does not work for serial ports, etc. FileConn calls newFileFD which
does syscall.GetsockoptInt(fd, syscall.SOL_SOCKET, syscall.SO_TYPE) which will fail for
non-sockets.

niemeyer commented 9 years ago

Comment 11:

Also FTR, the typical way to unblock a f.Read operation that is waiting for more data is
to f.Close, and the current implementation of these methods makes that process racy. An
implementation similar to the net package would also fix that.

tv42 commented 9 years ago

(This came up in a conversation today and I wanted to make sure people don't start off with incorrect assumptions.)

I want to be really clear on this: there is no such thing as regular file or dir I/O that wouldn't wait for disk on cache miss. I am not talking about serial ports or such here, but files and directories. Regular files are always "ready" as far as select(2) and friends are concerned, so technically they never "block", and "non-blocking" is the wrong word. But they may wait for actual disk IO to happen, and there is realistically no way to avoid that, in the general case (in POSIX/Linux).

The network poller / epoll has nothing to contribute to here. There is no version of read(2) and friends where the syscall would return early, without waiting for disk, if there's nothing cached. Go really has very little to do there.

People have been talking about extending Linux to implement non-waiting file IO (e.g. http://lwn.net/Articles/612483/ ) but that's not realistic today.

I don't see Go having much choice beyond threads doing syscalls. The real question in my mind is, is there a way to limit syscall concurrency to avoid swamping the CPU/OS with too many threads, while still avoiding deadlocks.

And just to minimize chances of confusion, file AIO ("Async I/O") is something very different, and not applicable to this conversation. It's a very restrictive API (actually, multiple), bypasses useful features like caching, and doesn't necessarily perform well at all.

dvyukov commented 9 years ago

What is wrong with io_submit? http://man7.org/linux/man-pages/man2/io_submit.2.html

tv42 commented 9 years ago

@dvyukov io_submit is the Linux AIO API (as opposed to POSIX AIO). It's a separate codepath and dependent on the filesystem doing the right thing; the implementations have been problematic, and using aio introduces a whole bunch of risk. The original implementation assumed O_DIRECT and this air remains; non-O_DIRECT operation is even more problematic. O_DIRECT is not safe to use for generic file operations because others accessing the file will use buffer cache. Without O_DIRECT e.g. the generic version of io_submit falls back to synchronous processing. Some filesystems don't handle unaligned accesses well. In some circumstances (e.g. journaling details, file space not preallocated, etc), io_submit has to wait until the operation completes, instead of just submitting an async request; this tends to be more typical without O_DIRECT. The default limit for pending requests is only 128; after that io_submit starts blocking. Finally, io_submit only helps with the basic read/write workload, no open(2), stat(2) etc.

I'm not saying it won't work, but I also would not be surprised if a change moving file IO to io_submit got reverted within a few months.

dvyukov commented 9 years ago

OK, then everybody should switch to Windows :)

minux commented 9 years ago

Why are we talking about POSIX AIO here? Go's syscall package needs to use raw syscalls and shouldn't depend on any interfaces partly implemented in userspace (e.g. glibc's POSIX AIO is partly implemented in user space.)

gopherbot commented 7 years ago

CL https://golang.org/cl/36799 mentions this issue.

gopherbot commented 7 years ago

CL https://golang.org/cl/36800 mentions this issue.

manishrjain commented 7 years ago

Just to add some numbers to the discussion, we're facing this problem as well.

Running fio on an Amazon EC2 i3.large instance, we're able to get 64K IOPS on a 4K block size, using 8 jobs, file size = 4G for random reads. (Other times, I've seen it go close to 100K IOPS)

We created a small Go program to do the exact same thing using Goroutines. And it doesn't budge above 20K IOPS. In fact, the throughput won't increase any further once the number of Goroutines reach the number of cores. This strongly indicates that Go is paying the cost of context switching, because of doing blocking read in every iteration of the loop.

Full Go code here: https://github.com/dgraph-io/badger-bench/blob/master/randread/main.go go build . && ./randread --dir /mnt/data/fio --preads 6500000 --jobs <num-cores>

It seems like using an async IO is the only way to achieve IO throughput in Go. SSDs are able to push more and more throughput with every release; so there has to be a way in Go to realize that.

bradfitz commented 7 years ago

@manishrjain, what fio command line are you comparing against?

Btw, your benchmark has a global mutex (don't use rand.Intn in goroutines if you want performance). That would show up if you look at contention profiles.

manishrjain commented 7 years ago

This is the fio command on my computer, and the output:

$ fio --name=randread --ioengine=libaio --iodepth=32 --rw=randread --bs=4k --direct=0 --size=4G --numjobs=4 --runtime=60 --group_reporting
randread: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
fio-2.19
Starting 4 processes
Jobs: 4 (f=4): [r(4)][100.0%][r=159MiB/s,w=0KiB/s][r=40.8k,w=0 IOPS][eta 00m:00s]
randread: (groupid=0, jobs=4): err= 0: pid=19525: Wed May 17 09:47:41 2017
   read: IOPS=43.8k, BW=171MiB/s (179MB/s)(10.3GiB/60001msec)
    slat (usec): min=2, max=13539, avg=90.15, stdev=98.95
    clat (usec): min=1, max=27856, avg=2829.90, stdev=708.48
     lat (usec): min=6, max=27873, avg=2920.05, stdev=724.33
    clat percentiles (usec):
     |  1.00th=[ 1512],  5.00th=[ 1816], 10.00th=[ 1992], 20.00th=[ 2224],
     | 30.00th=[ 2416], 40.00th=[ 2608], 50.00th=[ 2800], 60.00th=[ 2992],
     | 70.00th=[ 3184], 80.00th=[ 3408], 90.00th=[ 3696], 95.00th=[ 3920],
     | 99.00th=[ 4448], 99.50th=[ 4704], 99.90th=[ 7200], 99.95th=[ 8896],
     | 99.99th=[15168]
    lat (usec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 250=0.01%
    lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01%
    lat (msec) : 2=10.11%, 4=85.93%, 10=3.92%, 20=0.03%, 50=0.01%
  cpu          : usr=1.10%, sys=10.23%, ctx=1464338, majf=0, minf=153
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwt: total=2627651,0,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=171MiB/s (179MB/s), 171MiB/s-171MiB/s (179MB/s-179MB/s), io=10.3GiB (10.8GB), run=60001-60001msec

And the corresponding Go program: go build . && ./randread --dir ~/diskfio --jobs 4 --num 1000000 --mode 1

I switched from using global rand to local rand, and it doesn't show up in block profiler or cpu profiler. Fio is getting 43.8K IOPS. My program in Go is giving me ~25K, checked via sar -d 1 -p (the Go program is reporting lesser than what I see via sar, so must be a flaw in my code somewhere).

bergwolf commented 5 years ago

@tv42 @rsc Sorry for jumping late in the dead/old discussion.

I'm not saying it won't work, but I also would not be surprised if a change moving file IO to io_submit got reverted within a few months.

Would it be acceptable to expose the file IO semantics (DIO/AIO) and let programers decide? It is a hard decision for golang because compiler/runtime can't know the underlying storage media speed. But programers especially those who write purpose-built storage components with go, should know better the targeting storage. As a targeting example, it would be possible to write a similar program like fio in Go, instead of letting the compiler/runtime decide everything about file IO.

ianlancetaylor commented 5 years ago

@bergwolf In a sense the semantics are already exposed via the golang.org/x/sys/unix package, which lets you do anything the system supports.

I don't see how it would make sense to expose these semantics in the os package. That would add a lot of complexity for the benefit of very very few users. I've got nothing against rewriting the os package to use a different underlying mechanism, such as io_submit, while retaining the same API, if that makes sense. But I would vote against complicating the API.

tv42 commented 5 years ago

New development: io_uring is two ringbuffers used to message requests and completions about file I/O. It might be promising. Only filesystem files supported right now, can't use it on sockets, pipes etc at this time. https://lwn.net/Articles/776703/

ianlancetaylor commented 5 years ago

@tv42 There is some discussion of io_uring on #31908.

The-Alchemist commented 5 years ago

It's probably dated, but there's a nice little paper at https://pdfs.semanticscholar.org/d4a6/852f0f4cda6cf0431e04b81771eea08f88e2.pdf:

"An Attempt at Reducing Costs of Disk I/O in Go" by Sean Wilson, Riccardo Mutschlechner

golang / go

os: "async" file IO #6817