Network and Disk I/O blocking and file handles : optimizations

markpapadakis commented 8 years ago

Streaming from broker

We are using sendfile() to stream data from segment logs to clients (or brokers who are acting as followers). This works great, and this is what Kafka’s doing, but maybe we can do better, considering that sendfile() can block if the data is neither on the disk cache nor on a fast SSD storage, which will in turn affect other producers and consumers, because of the current single-thread design, although even if we do wind up using multiple threads on the server, it still won’t guarantee a mostly block-free operation.

NGINX and Netflix contributed an excellent new sendfile implementation for FreeBSD, which supports AIO, which is really exactly what’d love to be able to use.Specifically, that new sys call adds 2 new flags and refines an existing flag (SF_NOCACHE, SF_READAHEAD, SF_NODISKAIO). Unfortunately, this won’t become available on Linux anytime soon.

[ ] If we are going to support FreeBSD we absolutely need to take advantage of it

We could consider Linux AIO (use of libaio, with -laio and libaio.h, io_submit() etc), but that’d require opening files with O_DIRECT, which comes with a whole lot of restrictions, and even then, we ‘d have to transfer from the file to user space memory, and then use write() to stream to the socket, or a fairly elaborate scheme with pipes and use of the various *splice(), tee() methods. I am not sure the complexity is going to be worth it, or that we ‘d necessarily get more performance out of it, given the need for more sys calls and need to copy or shuffle around more data.

Another alternative is use of mmap() and then use of *splice() methods to transfer mmaped file data to the socket. Many of those sys calls accept flags, and SPLICE_F_MOVE|SPLICE_F_NONBLOCK may come in handy. We still need to resort to pipe trickery, but again, this may be worth it.

We should also consider LightHTTPD’s ‘asynchronous’ sendfile hack. Effectively what they do is:

create a buffer in /dev/shm and mmap() it
initiate an asynchronous read from the source file to the mapped buffer
wait until the data is ready
use sendfile() to send the data from /dev/shm to the network socket.

Indeed, the data is never copied to userspace; they are moved from kernel/user space. It requires use of AIO (or POSIX AIO or some other userspace threads I/O handoff scheme). The implementation can be found here.

All told, there are other options to consider, especially if we are going to support other OS and platforms. This all comes down to reducing or even eliminating the likelihood for blocking sendfile() operations, so that other consumers/producers won’t block waiting for it. It may not be really worth it for now, but we should come back to this if and when it does.

Appending bundles to segments

We are using writev() to append data to segment log files, which is always going to be fast because it’s an append operation(although there are edge cases where it may not work like that). This should almost never block, but it might.

We can, again, rely on AIO (specifically, linux AIO) for this, in order to minimize or eliminate the likelihood for blocking writev(). The problem again is that it requires opening files /w O_DIRECT, and the underlying filesystem must properly support AIO semantics. XFS seems to be the only safe choice — in fact, only 3.16+ Linux Kernel includes an XFS impl. that properly deals with appends.

We could take into account the OS/architecture and filesystem, to optionally use AIO to do this.

File handles

If we are going to support many thousands of partitions, we need to consider the requirements. Specifically, we currently need 2 FDs for each partition(for the current segment’s log and index), and 1 index for each immutable segment. So for a partition of 5 immutable segments, we ‘d need 5 + 2 = 7 FDs. Furthermore, we need to mmap() all index files, although those are fairly small.

We could maintain a simple LRU (or maybe look into alternative replacement policies) cache of all FDs for opened segment files and limit it based on e.g getrlimit(, RLIMIT_NOFILE). So whenever we ‘d get EMFILE from accept4(), open(), socket() etc, we ‘d ask the cache to close FDs. If we need to open a file, and we get EMFILE, we ‘d need the cache to close FDs so that we can open the file — if the cache is empty it means that we have used all FDs for sockets and we should perhaps try to use setrlimit() to adjust RLIMIT_NOFILE.

We are not going to need to solve this problem yet, but we should consider this for both performance reasons and for efficient support of thousands or even million of partitions.

Warming up disk pages

We can use MINCORE(2) to determine which segment log pages not current in-memory(block/file caches) and then 'touch' them so that they are paged-in prior to accessing them. We should also look into the use of fcntl(fd, F_NOCACHE), posix_fadvise(), readahead(), fadvise(), posix_fallocate() and fallocate() calls and use them when and where appropriate.

markpapadakis commented 8 years ago

Now using readhaead() on Linux. I still need to measure and quantify the cost/impact and gains though.

markpapadakis commented 8 years ago

According to the Linux Kernel implementation, readahead() will simply walk all pages in the requested file range, look up the page in the mappings RBT and if not already there(not cached) will schedule a read ahead for it.

This means that the cost is minimal, other than iterating the pages and looking each one up on the RBT -- which seems like a good tradeoff.

markpapadakis commented 8 years ago

More measurements

writev()

It takes 0.081s to writev() 95MBs This means we can do, in theory, 1,172MBs / second (1/0.081 * 95) if we were to write continuously in a single thread. Obviously that's not going to happen and we can't hit that kind of rate, but it's nice to know the theoretical maximum, at least on origin.

readahead()

for 32MBs:

if all pages already available in cache, it takes 2us
if just freed pagecache, dentries and inodes via echo 3 > /proc/sys/vm/drop_caches, it takes 149us
if just freed pagecache via echo 1 > /proc/sys/vm/drop_caches, it takes 56us

sendfile()

if all pages already available in cache, for 2.8MBs, it takes 300us
if just freed pages cache, for 15Mbs, it takes 0.11s
if just freed pages cache, and didn't use readahead() prior to sendfile(), for 23Mbs it takes 0.173s

markpapadakis commented 7 years ago

XFS serializes all I/O when it sees a size-changing operation, like an append. This slows down ScyllaDB's writes, because files can only be written with a concurrency of 1. Enabling file_output_stream's write-behind mode causes xfs to block in the reactor thread, destroying performance.

This doesn't affect Tank because of the single-threaded design, but it's good to keep this in mind.

From code comments:

// The Linux XFS implementation is challenged wrt. append: a write that changes
// eof will be blocked by any other concurrent AIO operation to the same file, whether
// it changes file size or not. Furthermore, ftruncate() will also block and be blocked
// by AIO, so attempts to game the system and call ftruncate() have to be done very carefully.
//
// Other Linux filesystems may have different locking rules, so this may need to be
// adjusted for them.

markpapadakis commented 7 years ago

As of 0925e893958f4ec870e5803c8ddbd94e58b2a8ec, Tank will be more fair to clients, by trying to reduce the time it would block in sendfile(), because either a consumer requested a very large payload(e.g dozens of MBs) and/or reading from the underlying filesystem is very slow(low transfer rates).

Instead of executing a single sendfile() for say 32MBs, it will instead break this down into 512K requests, and once it has transferred 4MBs it will return control to the main I/O loop thereby giving a chance to other connections/clients, and then resuming the transfer from where it was left of. By breaking down the single sendfile() call to multiple, there is a better chance for the kernel to have paged-in the contents already in time for subsequent sendfile() calls - we readahead() when we receive a consume request, instructing the kernel to page-in data in the background, before we commence streaming data.

This helps a lot when you have lots of connections/clients, and you want to be fair to them, where 1 or 2 clients asking for dozens of MBs won't block processing of incoming requests or transfer of outgoing responses.

There is no perfect solution though. We could dedicate one thread/connection, but that'd result in other problems, so that when the kernel puts a connection/client thread to sleep because it blocks for I/O, another would be scheduled in its place. You may want to read SeaStar's tutorial for why this is not a good idea.

We could have used AIO, and it could have worked, except that there caveats. In practice, we 'd be required to use O_DIRECT access and XFS(though other filesystems are catching up). That means we 'd bypass the kernel cache and we 'd need our own cache. Furthermore, the more data you need to read asynchronous, the longer it takes to setup the request -- that is, the time it takes to io_submit() is proportional to the number of bytes you request. So, while this may work great for, say, a Database, it's not a good fit for a high-performance streaming engine.

We could have used a threads-pool, and manage multiple clients/thread, which could mitigate the effect, but at the cost of complexity and state serialization overhead.

Ultimately, short of the Linux Kernel introducing new APIs that in effect signal the userspace application before a thread is about to be blocked, thereby giving it a chance to e.g yield to an application managed fiber/green-thread, and conversely be signalled again when a blocking operation has been completed and the thread will be made runnable again, the best all way would be to just port Netflix’s sendfile() improvements to Linux (see earlier comments).

markpapadakis commented 7 years ago

Now using another heuristic; if total time spent in try_send() (specifically, we keep track of total time spent in sendfile() ), then we abort early. Also, we initially sendfile() 128k and then switch to 640k / iteration, and transmit threshold (total amount of bytes that can be sent in the current try_send() call ) has been raised to 24MBs.

It turns out, that if the data to be accessed are missing from the kernel VM caches, e.g

free && sync && echo 3 > /proc/sys/vm/drop_caches && free

it takes 1000s of microseconds to sendfile() a few KBs worth of data, whereas if data is in cache, it takes no more than 250-300 microseconds. So having a fixed upper limit (currently, 3k microseconds) helps identifying situations where data is not in cache and sending however much was requested would require spending too long in try_send(), at the expense of all other active requests. Also, this further increases the likelihood of readahead() paging-in the data required by the time the next try_send() call attempts to sendfile() again.

[ ] Consider wether its a good idea to compute a time budget based on the connection that we need to first wake up (based on max wait) and the load of the system(number of connections, etc).

markpapadakis commented 6 years ago

Threads Pool in NGINX: NGINX is now using thread pools, so that sendfile() won't block the I/O thread, because they too figured out that when it needs to block to page-in blocks, it can can kill performance. I should consider this - and expose it as a tank deamon option.

markpapadakis commented 6 years ago

Serving 100 Gbps from an Open Connect Appliance : A great write-up mostly specific to FreeBSD sendfile() impl. and kernel semantics, but likely relevant to what we do here as well.

markpapadakis commented 6 years ago

The new preadv2 and pwritev2 syscalls, first appeared in Linux Kernel 4.6 are extremely powerful, when RWF_NOWAIT flag is used -- that flag made it into even more recent Linux Kernel releases though.

See more here: https://news.ycombinator.com/item?id=15412534 It effectively means we could use use preadv2 to read as much as we can, instead of use of sendfile(), and if there is no more data in the cache, we 'd get EWOULDBLOCK and we would just hand it off to some background threads pool which would do it for us, thereby not blocking the main network I/O thread (we could use fibers/coros to simplify things as well).

Will probably support this once soon.

markpapadakis commented 4 years ago

It was suggested that sendfile() would return EAGAIN/EWOULDBLOCK if the file was opened with O_NONBLOCK, and sendfile() would need to block. Alas, sendfile() still blocks. It would have been fantastic if this worked.

giampaolo commented 4 years ago

The silver bullet in term of async file IO on Linux nowadays appears to be io_uring. Another interesting thing is KTLS, which basically allows you to use sendfile with SSL sockets and do the zero-copy + encryption in kernel space. It's unclear whether io_uring and KTLS can be mixed together.

markpapadakis commented 4 years ago

@giampaolo I am familiar with io_uring; in fact, I 've been following the development since it was announced, and I have been experimenting with it for some time, and I discussed ways to implement a zero-copy sendfile alternative via io_uring (for TANK, and for other projects). I suspect TANK will support io_uring for disk (and later, network) I/O soon.

I wasn't familiar with KTLS -- this looks great. Currently, TANK does not support TLS connections (because we have no need for that, and none requested that either), but it should be pretty trivial to support TLS connections if such need arises.

giampaolo commented 4 years ago

I discussed ways to implement a zero-copy sendfile alternative via io_uring

Is that public? I'd be interested in taking a look at it as I want to try to integrate it in an async FTP server. Extra: am not sure if this is useful for TANK (note: I'm not a user of TANK, I just ended up here by accident), but FYI with splice() you can speed up file receiving (around 10-15%).

markpapadakis commented 4 years ago

@giampaolo no, the discussions are private — the ideas discussed didn’t pan out though but will consider other alternatives. sys_sendfile will eventually be supported ( or a similar zero copy opcode anyway ) in io_uring.

Considered splice() but in the end it wasn’t worth it. You may want to study haproxy’s implemention if you are interested in similar ideas.

BorisPis commented 4 years ago

The silver bullet in term of async file IO on Linux nowadays appears to be io_uring. Another interesting thing is KTLS, which basically allows you to use sendfile with SSL sockets and do the zero-copy + encryption in kernel space. It's unclear whether io_uring and KTLS can be mixed together.

No, this is not possible as of today.

phaistos-networks / TANK