Open dvyukov opened 10 years ago
I don't think there is asynchronous stat syscall available, and IIRC, most event based web servers take great pain to optimize the stat(2)-taking-up-a-thread problem (e.g. dedicated stat(2) thread pools) Similarly for readdir, is there a pollable version available? I don't know if readdir/stat is contributing to the godoc problem, but I think they might be a problem if the GOPATH is large enough.
This continually bites me. I have an interface that has both network and filesystem implementations and the network one works great (firing off a bounded number of goroutines: say, 1000) but then the filesystem implementation of the interface kills the OS, and my code has to manually limit itself, which feels like a layering violation. The runtime should do the right thing for me. runtime/debug.SetMaxThreads sets the max threads Go uses before it blows up. If we can't do async filesystem IO everywhere (and I don't think we can), then I think we should have something like runtime/debug.SetMaxFSThreads that controls the size of the pool of thread doing filesystem operations but blocks instead of crashes. That means for the read/write syscalls, we'd have to know which fds are marked non-blocked (network stuff) and which aren't. Or we put all this capping logic in pkg/os, perhaps opt-in. pkg os func SetMaxFSThreads(n int)
(This came up in a conversation today and I wanted to make sure people don't start off with incorrect assumptions.)
I want to be really clear on this: there is no such thing as regular file or dir I/O that wouldn't wait for disk on cache miss. I am not talking about serial ports or such here, but files and directories. Regular files are always "ready" as far as select(2)
and friends are concerned, so technically they never "block", and "non-blocking" is the wrong word. But they may wait for actual disk IO to happen, and there is realistically no way to avoid that, in the general case (in POSIX/Linux).
The network poller / epoll has nothing to contribute to here. There is no version of read(2)
and friends where the syscall would return early, without waiting for disk, if there's nothing cached. Go really has very little to do there.
People have been talking about extending Linux to implement non-waiting file IO (e.g. http://lwn.net/Articles/612483/ ) but that's not realistic today.
I don't see Go having much choice beyond threads doing syscalls. The real question in my mind is, is there a way to limit syscall concurrency to avoid swamping the CPU/OS with too many threads, while still avoiding deadlocks.
And just to minimize chances of confusion, file AIO ("Async I/O") is something very different, and not applicable to this conversation. It's a very restrictive API (actually, multiple), bypasses useful features like caching, and doesn't necessarily perform well at all.
What is wrong with io_submit? http://man7.org/linux/man-pages/man2/io_submit.2.html
@dvyukov io_submit is the Linux AIO API (as opposed to POSIX AIO). It's a separate codepath and dependent on the filesystem doing the right thing; the implementations have been problematic, and using aio introduces a whole bunch of risk. The original implementation assumed O_DIRECT and this air remains; non-O_DIRECT operation is even more problematic. O_DIRECT is not safe to use for generic file operations because others accessing the file will use buffer cache. Without O_DIRECT e.g. the generic version of io_submit falls back to synchronous processing. Some filesystems don't handle unaligned accesses well. In some circumstances (e.g. journaling details, file space not preallocated, etc), io_submit has to wait until the operation completes, instead of just submitting an async request; this tends to be more typical without O_DIRECT. The default limit for pending requests is only 128; after that io_submit starts blocking. Finally, io_submit only helps with the basic read/write workload, no open(2)
, stat(2)
etc.
I'm not saying it won't work, but I also would not be surprised if a change moving file IO to io_submit got reverted within a few months.
OK, then everybody should switch to Windows :)
Why are we talking about POSIX AIO here? Go's syscall package needs to use raw syscalls and shouldn't depend on any interfaces partly implemented in userspace (e.g. glibc's POSIX AIO is partly implemented in user space.)
CL https://golang.org/cl/36799 mentions this issue.
CL https://golang.org/cl/36800 mentions this issue.
Just to add some numbers to the discussion, we're facing this problem as well.
Running fio on an Amazon EC2 i3.large instance, we're able to get 64K IOPS on a 4K block size, using 8 jobs, file size = 4G for random reads. (Other times, I've seen it go close to 100K IOPS)
We created a small Go program to do the exact same thing using Goroutines. And it doesn't budge above 20K IOPS. In fact, the throughput won't increase any further once the number of Goroutines reach the number of cores. This strongly indicates that Go is paying the cost of context switching, because of doing blocking read in every iteration of the loop.
Full Go code here: https://github.com/dgraph-io/badger-bench/blob/master/randread/main.go
go build . && ./randread --dir /mnt/data/fio --preads 6500000 --jobs <num-cores>
It seems like using an async IO is the only way to achieve IO throughput in Go. SSDs are able to push more and more throughput with every release; so there has to be a way in Go to realize that.
@manishrjain, what fio
command line are you comparing against?
Btw, your benchmark has a global mutex (don't use rand.Intn
in goroutines if you want performance). That would show up if you look at contention profiles.
This is the fio command on my computer, and the output:
$ fio --name=randread --ioengine=libaio --iodepth=32 --rw=randread --bs=4k --direct=0 --size=4G --numjobs=4 --runtime=60 --group_reporting
randread: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
fio-2.19
Starting 4 processes
Jobs: 4 (f=4): [r(4)][100.0%][r=159MiB/s,w=0KiB/s][r=40.8k,w=0 IOPS][eta 00m:00s]
randread: (groupid=0, jobs=4): err= 0: pid=19525: Wed May 17 09:47:41 2017
read: IOPS=43.8k, BW=171MiB/s (179MB/s)(10.3GiB/60001msec)
slat (usec): min=2, max=13539, avg=90.15, stdev=98.95
clat (usec): min=1, max=27856, avg=2829.90, stdev=708.48
lat (usec): min=6, max=27873, avg=2920.05, stdev=724.33
clat percentiles (usec):
| 1.00th=[ 1512], 5.00th=[ 1816], 10.00th=[ 1992], 20.00th=[ 2224],
| 30.00th=[ 2416], 40.00th=[ 2608], 50.00th=[ 2800], 60.00th=[ 2992],
| 70.00th=[ 3184], 80.00th=[ 3408], 90.00th=[ 3696], 95.00th=[ 3920],
| 99.00th=[ 4448], 99.50th=[ 4704], 99.90th=[ 7200], 99.95th=[ 8896],
| 99.99th=[15168]
lat (usec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 250=0.01%
lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=10.11%, 4=85.93%, 10=3.92%, 20=0.03%, 50=0.01%
cpu : usr=1.10%, sys=10.23%, ctx=1464338, majf=0, minf=153
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued rwt: total=2627651,0,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
READ: bw=171MiB/s (179MB/s), 171MiB/s-171MiB/s (179MB/s-179MB/s), io=10.3GiB (10.8GB), run=60001-60001msec
And the corresponding Go program:
go build . && ./randread --dir ~/diskfio --jobs 4 --num 1000000 --mode 1
I switched from using global rand to local rand, and it doesn't show up in block profiler or cpu profiler. Fio is getting 43.8K IOPS. My program in Go is giving me ~25K, checked via sar -d 1 -p
(the Go program is reporting lesser than what I see via sar, so must be a flaw in my code somewhere).
@tv42 @rsc Sorry for jumping late in the dead/old discussion.
I'm not saying it won't work, but I also would not be surprised if a change moving file IO to io_submit got reverted within a few months.
Would it be acceptable to expose the file IO semantics (DIO/AIO) and let programers decide? It is a hard decision for golang because compiler/runtime can't know the underlying storage media speed. But programers especially those who write purpose-built storage components with go, should know better the targeting storage. As a targeting example, it would be possible to write a similar program like fio
in Go, instead of letting the compiler/runtime decide everything about file IO.
@bergwolf In a sense the semantics are already exposed via the golang.org/x/sys/unix package, which lets you do anything the system supports.
I don't see how it would make sense to expose these semantics in the os package. That would add a lot of complexity for the benefit of very very few users. I've got nothing against rewriting the os package to use a different underlying mechanism, such as io_submit
, while retaining the same API, if that makes sense. But I would vote against complicating the API.
New development: io_uring is two ringbuffers used to message requests and completions about file I/O. It might be promising. Only filesystem files supported right now, can't use it on sockets, pipes etc at this time. https://lwn.net/Articles/776703/
@tv42 There is some discussion of io_uring
on #31908.
It's probably dated, but there's a nice little paper at https://pdfs.semanticscholar.org/d4a6/852f0f4cda6cf0431e04b81771eea08f88e2.pdf:
"An Attempt at Reducing Costs of Disk I/O in Go" by Sean Wilson, Riccardo Mutschlechner