ned14 / llfio

P1031 low level file i/o and filesystem library for the C++ standard
https://ned14.github.io/llfio/
Other
880 stars 45 forks source link

Writes, Alignment, and Block Size #116

Closed zpyatt closed 1 year ago

zpyatt commented 1 year ago

Hi,

I'm writing files that contain a header and footer, since the header is exactly 512 bytes it's no problem. The footer on the other hand is variable length. The ICD for the file format doesn't really allow for extra padding; however, if I don't pad the write hangs indefinitely.

File opened with:

fh_ = llfio::file(
      {},
      file_,
      llfio::file_handle::mode::write,
      llfio::file_handle::creation::always_new,
      llfio::file_handle::caching::none
    ).value();
void write_footer() {
  alignas(4096) footer ftr;
  ftr.date = fmt::format("{0:%y%b%d}", meta_.timestamp);;
  ftr.time = fmt::format("{0:%H:%M:%S}", meta_.timestamp);
  ftr.time_epoch = fmt::format("{0:%d%b%y_%H:%M:%S}", meta_.timestamp);
  log_->info("Writing Ftr: Addr: {:x}\tSize: {", (uint64_t) std::addressof(ftr), sizeof(ftr));
  auto bw = write(offset_, reinterpret_cast<const llfio::byte*>(&ftr), sizeof(ftr));
  log_->info("Wrote Ftr: Addr: {:x}\tSize: {}\tBytes Written: {}", (uint64_t) std::addressof(ftr), sizeof(ftr), bw);
}

My write function:

std::size_t write(std::size_t offset, llfio::byte const* bytes, std::size_t num_bytes) {
  auto bytes_w = fh_.write( offset, {{bytes, num_bytes}}).value();
  if(bytes_w != num_bytes) {
      log_->info("llfio: bytes tx'd ({0}) != bytes written ({1})", bytes_w, num_bytes);
  }
  return bytes_w;
}

I think the problem is " llfio::file_handle::caching::none", but I'm writing packets from 10-Gig E ( I do buffer them to 16 MB buffers, aligned on 512 byte boundaries), and no caching seems to be the fastest means of writing. Suggestions?

Thanks, /ZRP

ned14 commented 1 year ago

Not strictly speaking an issue with LLFIO this, but I'll try to dump some of my experience.

Generally speaking, I've found that getting socket data into a file at maximum efficiency the best fully portable approach is to create a mapped file handle, truncate it to the correct size, and then perform socket reads directly into the memory mapped region. This is an optimised code path on every major OS (they shuffle whole memory pages without copying them) except for Linux, but it doesn't suck on Linux either. If you're Linux only, look into SOCK_ZEROCOPY or splice() which are two different ways of doing zero copy socket i/o, each with tradeoffs and weird performance quirks. I've personally found both rather disappointing in real world use cases, but the NICs I was using were not high end and that probably meant the Linux kernel couldn't avoid doing a copy.

To my best knowledge, even on very high end hardware complete avoidance of copying to RAM between socket and file isn't currently feasible. In years to come, Windows might be able to do it if and only if your SSD supports it, and your NIC supports it, and each supports direct DMA to the other and your OS kernel has the right drivers loaded, and the wind is blowing right on the day. I think usable support on Linux is some years away currently. Right now, best available is kernel bypass for storage and network (e.g. SPDK, onload efvi etc) but your socket data still needs to go into RAM at least once. One day eBPF might be able to talk to storage, then if you have root you can load in BPF to tell your high end NIC to directly DMA to your high end SSD no need to involve the CPU nor RAM at all. That's several years out however, and that root-only requirement makes that use case niche.

In any case, I don't think uncached i/o gains you much here as you cannot avoid a memory copy. You might consider an append only file, then the filesystem knows you'll be appending and will optimise appropriately, and after each socket write you might try a non blocking barrier to hint to the OS that it should flush the appended data sooner rather than later. Alternatively, you can combine append only with reads caching only, then the filing system knows that all writes must go to storage immediately and may optimise accordingly.

In short, I'd try permutating whether you have the NIC or the filesystem do the memory copy, some file systems will do much better if they know it's append only, others do much better if write caching is disabled, still others do better with non-blocking barriers. Also, things can vary depending on kernel version, combination of NIC and SSD, and indeed which NUMA node things are on. You can see performance 4x better with certain combos for no obvious reason. I've even seen servers where performance was 2x better some weeks but not on others depending on random chance.

I appreciate all that will be dispiriting, but in the end this is why we get paid, it's to solve these sorts of problem. All I can suggest is trial and error testing, avoid memory copies, do your i/o to buffers which come from mmap() (i.e. map_handle), prefault in memory and preset file lengths where possible, don't forget to see if discarding memory page contents before an i/o isn't a large improvement (some kernels will just swap memory PTEs if the destination memory pages are not dirty). Also sometimes a 16Mb i/o quantum is actually slower than smaller buffers because life is never easy :)

zpyatt commented 1 year ago

Niall,

Thank you for the very detailed response, still digesting it. You should consider publishing a book on high performance I/O, I'd buy it.

If I'm understanding correctly the infinite hang is indeed caused by "llfio::file_handle::caching::none", but it's NOT LLFIO related, it's just a consequence of un-cached I/O. I was afraid of that, just thought there might be a way to force it to flush or something.

I left out a lot of detail on my actual use case, I:

16 MB was chosen empirically, not sure why it seems to be best.

I did consider memory mapped files, but I don't understand them very well (pages, super-pages, etc...), in particular sizing the memory mapped file. I have no idea how big the files will need to be. Basically the user hits the record button, and it writes till they hit stop, even in a circular buffer if necessary. Once the file is stopped, then I can write both the header and footer, as these both have fields I don't know when I start writing.

For some reason with cached I/O I get packet slips, and my object pool can't keep up. Only with "llfio::file_handle::caching::none" was I able to make this work. Perhaps I need to look at it again. After re-reading the ICD I think I'll be fine padding the footer, just more work.

If things don't work out my next steps are: try upgrading the kernel to one with io_uring, or kernel bypass if I have to (really don't want to do that).

I really wish kernels provided a means of using DMA to route data between NIC, GPGPU, and Disks. I can't even use the GPU for my signal processing because the memory transfers are too costly.

Thanks, /ZRP

ned14 commented 1 year ago

Thank you for the very detailed response, still digesting it. You should consider publishing a book on high performance I/O, I'd buy it.

I have received many offers now from publishers, all with somewhat attractive compensation. The problem is lack of time. Also, anything I write would become stale within a few years I suspect.

If I'm understanding correctly the infinite hang is indeed caused by "llfio::file_handle::caching::none", but it's NOT LLFIO related, it's just a consequence of un-cached I/O. I was afraid of that, just thought there might be a way to force it to flush or something.

LLFIO is a dumb syscall wrapper. It doesn't do anything bar the bare minimum. Certainly around caching, it just passes through the flags to the kernel, and does nothing else.

Forcing flushes isn't necessary with uncached i/o, as writes don't return until all data is on disk. barrier() is how you flush data written into cache, or encourage the kernel to hurry up doing so asynchronously.

I did consider memory mapped files, but I don't understand them very well (pages, super-pages, etc...),

Pages are 4Kb unless you have DAX mounted storage, then you may get 2Mb.

in particular sizing the memory mapped file. I have no idea how big the files will need to be. Basically the user hits the record button, and it writes till they hit stop, even in a circular buffer if necessary. Once the file is stopped, then I can write both the header and footer, as these both have fields I don't know when I start writing.

On Linux, almost every filing system implements sparse storage. So truncate to something like 2^36, it won't allocate any space, and write to an incrementing offset.

Generally if you want to write to a mapped file of unknown size, you can be lazy and just map the entire 2^36 into memory. Might be worth a try and see if your kernel can keep up. A less lazy solution is to map offsets into the file into a queue of maps of portions of the file, fill each in turn, unmap filled portions and map in new ones asynchronously. You get a lot of TLB shootdowns doing this, but it can work well for some i/o use cases.

For some reason with cached I/O I get packet slips, and my object pool can't keep up. Only with "llfio::file_handle::caching::none" was I able to make this work.

A 10 Gbit sustained write isn't especially much for storage, so something must be amiss here. At work we have a four wide RAID0 NVMe which we saturate with cached i/o, it certainly pushes 4-6Gb/sec (Gb not Gbit).

What's your filing system? ext4's delayed allocate introduces annoying write stalls. We run xfs here properly configured with striping pools across the parallel devices.

We also do cheat a bit, the API which ingresses packets is extremely light weight, does almost nothing but adds the packet to queues. A separate kernel thread drains those queues, does the processing and reorganisation of them, and puts the block to write into another queue. A further kernel thread drains that queue by appending to the file as fast as the kernel will take more data.

To make that work well, the key is to minimise the kernel threads interacting as every time they synchronise you lose performance. We have big chunks of data, the threads only interact every few milliseconds. For some of the bigger data feeds we can write more than 2^31 bytes per append (which is buggy on some Linux filesystems, incidentally)!

If things don't work out my next steps are: try upgrading the kernel to one with io_uring, or kernel bypass if I have to (really don't want to do that).

If your kernel doesn't support io_uring, then it won't have many of the filesystem layer performance quirk fixes. Nothing to do with io_uring, they just fixed many corner case problems around the same time.

Work codebase works much much better on kernels after 5.15 or so. Unfortunately prod is on 3.10, it has many performance bugs :(

It may be worth just trying a newer kernel and voila your existing code suddenly starts working well.

I really wish kernels provided a means of using DMA to route data between NIC, GPGPU, and Disks. I can't even use the GPU for my signal processing because the memory transfers are too costly.

PCIe has allowed it for donkey years now. I think even PCI allowed it. Problem has been device controller support, it's expensive to implement and vendors didn't want to without getting paid for it.

Incidentally, if they ever do expose storage to eBPF or the equivalent, I'm fairly sure for writes you won't see the block storage emulation layer which the drive exposes to NVMe, but rather the underlying implementation layer. Which I don't believe is capable of appends currently, so your eBPF code would have to reimplement the driver's block storage emulation layer by hand. Chances are you'll run into the maximum eBPF program size quite quickly.

(i.e. all this just working without lots of effort is at least a decade away)

zpyatt commented 1 year ago

I'm semi-stuck with CentOS 7.5 which has kernel version 3.10, I think. I compiled a version 5 kernel from scratch, and things worked worse; however, recently I was reading about all the "security features" (Spectre, Meltdown, etc...) that were added to kernel 5 which really destroy performance, so maybe if I turn some of those off.

Also, I am using an ext4 filesystem. I thought it would be easier for our customers, but I'll have to look into those stalls.

I'm more a jack-of-all-trades master of none. I've done real-time bare metal, VxWorks, Linux, Windows, etc... You just have to be an expert in so many areas these days. Kinda longing for VxWorks, but that certainly had it's own unique challenges, and few nice libraries like ASIO or LLFIO (at least w/o significant work).