[Question] Next steps about vtable and mmap?

quark-zju commented 4 years ago

I have been looking for a zero-copy way of sharing mmap content via Bytes. I noticed the vtable feature added by #294 and didn't find follow-ups.

I wonder if bytes is interested in getting a Arc<memmap::Mmap> version implemented (probably gated by a mmap feature that is default off). Or if other approaches are preferred (ex. expose the vtable interface and implement the mmap support in other crates).

I personally think there are very limited types that meaningfully fit the vtable interface (I can hardly think of another aside from the mmap one). So it seems easier for end-users if bytes just implements them all (use features to keep the default deps slim). But I can see concerns about memmap being unmaintained.

If the next steps are clear, I can help implementing them.

spl commented 4 years ago

For what it's worth, I'm also interested in a Bytes-like interface to a memory-mapped region.

quark-zju commented 4 years ago

Hi @spl, FYI I ended up reinventing it. It's not a lot of code if Arc<dyn Trait> is used.

spl commented 4 years ago

@quark-zju Nice! Your BytesOwner is a neat approach. I drafted something using the following:

enum Buf {
    Mmap(Mmap),
    Static(&'static [u8]),
    Vec(Vec<u8>),
}

pub struct SharedBuf {
    buf: Arc<Buf>,
    offset: usize,
    len: usize,
}

Of course, this wasn't intended for general-purpose use as a library.

spl commented 4 years ago

Could bytes provide something like the following from @quark-zju's library?

impl Bytes {
    pub fn from_owner(value: impl BytesOwner) -> Self { ... }
}

pub trait BytesOwner: AsRef<[u8]> + Send + Sync + 'static {}

scottlamb commented 4 years ago

I personally think there are very limited types that meaningfully fit the vtable interface (I can hardly think of another aside from the mmap one).

Here's some food for thought. Moonfire NVR currently uses a bytes::Buf implementation that's a wrapper around reffers::ARefss<'static, [u8]>. It puts these into a hyper::body::HttpBody implementation and sends them to hyper. It does two things with this:

1. mmap

It does this today, but I'm not really sure it's a good idea because of the page fault problem. As lucio-rs suggested here, it might be better to simply copy stuff from the mmaped region to a "normal" bytes rather than having a mmap-backed bytes. My current approach of not dealing with this at all means that the tokio threads may stall on page faults, which is not so great. It's really not good enough to have a bytes object for a mmap()ed region which is only special for loading and unloading, not also when reading from it.

2. more complex shared ownership / minimizing allocations

tl;dr version: there's one large object moonfire_nvr::mp4::File which lives for the entire response, is reference-counted all together via struct File(Arc<FileInner>), and has a small number of Vecs which turn into a larger number of HTTP response chunks. I'd need more allocations (bytes::bytes::Shared ones) to duplicate what it does with bytes::Bytes as it exists today.

Some detail, in case you're curious:

mp4.rs constructs a .mp4 file on-the-fly, including:

chunks from a "global" buf: Vec<u8> which has a bunch of dynamic file metadata (including stuff like length of structures, timestamps, etc) that gets stuffed between other parts of the file. It's nice to have that backed by the overall mp4::File rather than have a separate bytes::bytes::Shared allocation.
actual chunks of video frame data. Each segment comes from a different file with ~1 minute of H.264-encoded video frames in it.
indexes for the segment. There are three kinds of index needed for each segment. (stts = sample time table, stsz = sample size table, stss sync sample table) It's most efficient to all the indexes for a segment at once (but only if the HTTP request actually asks for a byte range that overlaps with at least one of them) and stuff them into one Vec. It's nice to not need another per-segment allocation for a bytes::bytes::Shared.

Here's the Debug output on a typical mp4::File. This structure gets generated on every HTTP request, then chunks may or may not actually get generated and served depending on the byte range the browser asked for.

LucioFranco commented 4 years ago

My current approach of not dealing with this at all means that the tokio threads may stall on page faults, which is not so great.

This effect may honestly not be horrible, if you're on modern hardware it should be somewhat fast. Really prolonged blocking is the issue.

scottlamb commented 4 years ago

By modern hardware I assume you mean SSD? Maybe so in that case. I have terabytes of files on cheap hardware, so I use spinning disk instead. I typically assume a seek takes about 10 ms but I just benchmarked it on this system at closer to 20 ms. wikipedia says a seek took 20 ms in the 1980s so basically on this metric I might as well be using 40-year-old hardware. And even the fastest modern hard drives can't do an order of magnitude better. The physics are prohibitive.

LucioFranco commented 4 years ago

wikipedia says a seek took 20 ms in the 1980s so basically on this metric I might as well be using 40-year-old hardware.

Haha, taking commodity hardware to a new level!

You are right, you can't assume what those times might be.

cc @seanmonstar on this issue since he wrote most of the vtable stuff.

quark-zju commented 4 years ago

@scottlamb I have some questions about the mp4::File use-case.

When constructing the Bytes backed by mp4::File, would it read the entire content of the file as an ordinary Bytes backed by Vec<u8> immediately or lazily?

If immediately, what is the benefit of not closing the File after reading? If the File is closed, it seems the ordinary Bytes fits the use-case.

If lazily, how would it actually implement the vtable interface? The current vtable struct needs plain ptr: *const u8 and len: usize. What's their values?

scottlamb commented 4 years ago

When constructing the Bytes backed by mp4::File, would it read the entire content of the file as an ordinary Bytes backed by Vec immediately or lazily?

Neither. There's not a single chunk (~ Bytes struct) for the whole file. A mp4::File produces a stream of many chunks. In the debug output I linked above, that file had 95 slices. When a HTTP request comes in, my code examines the requested byte range; there's a chunk for the portion of each slice that overlaps with that request, so up to 95 chunks in the stream. Chunks are created on demand via a bounded channel as hyper consumes previous chunks. Each is immediately usable once created.

"Immediately usable"...except for the problem of file-backed chunks causing major page faults that wait for a disk seek. Unless folks have a better idea, I plan to either abandon mmap entirely or copy from a mmaped region to a heap-backed Bytes instead to avoid this. I once was thinking of having a Bytes to represent a mlocked chunk of a mmaped file, but having a bunch of mlock/munlock calls doesn't seem like a good idea at all post-Meltdown.

The best thing long-term IMHO would be passing around something to cause a IORING_OP_SPLICE to happen from the file to the HTTP stream (with KTLS for https). With some fallback for older Linux kernels and non-Linux OSs. But I assume that's out of scope for the bytes crate, and the ecosystem as a whole needs a lot of work before it's possible...

quark-zju commented 4 years ago

Thanks for explanation. I think the Bytes::as_ref(&self) -> &[u8] API makes it unsuitable for non-continuous buffers and it is hard to change the API now. For complex backend backed by many chunks, perhaps it can use other abstractions like io::Seek + io::Read or their async versions.

seanmonstar commented 4 years ago

Buf is a trait in this crate that has fine support non-contiguous buffers. The Bytes type is meant to be an API to allow easy sharing and slicing of a single contiguous buffer.

scottlamb commented 4 years ago

I don't think Bytes being contiguous is a problem for me.

I built Entity::get_range on top of streams of chunks. While I'm using Buf today, it's a simple implementation with contiguous buffers like Bytes has as you can see here:

https://github.com/scottlamb/moonfire-nvr/blob/6b5359b7cb1502c4e24f9ae7fa81a2ceac9a4f23/src/body.rs#L69-L75

hyper often will batch up my chunks into a single syscall, as you can see from strace output:

8638  writev(42, [{iov_base="HTTP/1.1 206 Partial Content\r\nac"..., iov_len=449}, {iov_base="\0\0\0 ftypisom\0\0\2\0isomiso2avc1mp41", iov_len=32}, {iov_base="\0\24\r,moov\0\0\0xmvhd\1\0\0\0\0\0\0\0\333\23\270\225\0\0\0\0"..., iov_len=48}, {iov_base="\0\1\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=76}, {iov_base="\0\0\0\3\0\24\5[trak\0\0\0\\tkhd\0\0\0\7\333\23\270\225\333\23\270\225"..., iov_len=44}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=52}, {iov_base="\7\200\0\0\48\0\0\0\24\4\367mdia\0\0\0,mdhd\1\0\0\0\0\0\0\0"..., iov_len=60}, {iov_base="\0\0\0!hdlr\0\0\0\0\0\0\0\0vide\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=33}, {iov_base="\0\24\4\242", iov_len=4}, {iov_base="minf\0\0\0\24vmhd\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0$dinf"..., iov_len=60}, {iov_base="\0\24\4bstbl\0\0\0\226stsd\0\0\0\0\0\0\0\1", iov_len=24}, {iov_base="\0\0\0\206avc1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=134}, {iov_base="\0\r/\20stts\0\0\0\0\0\1\245\340", iov_len=16}, {iov_base="\0\0\0\1\0\0\7\10\0\0\0\1\0\0\16N\0\0\0\1\0\0\16\n\0\0\0\1\0\0\7,"..., iov_len=14400}, {iov_base="\0\0\0\1\0\0\7\24\0\0\0\1\0\0\16x\0\0\0\1\0\0\r\240\0\0\0\1\0\0\16\n"..., iov_len=14400}, {iov_base="\0\0\0\1\0\0\10r\0\0\0\1\0\0\r\34\0\0\0\1\0\0\r\220\0\0\0\1\0\0\6\362"..., iov_len=14400}, {iov_base="\0\0\0\1\0\0\6\370\0\0\0\1\0\0\168\0\0\0\1\0\0\r\374\0\0\0\1\0\0\6\374"..., iov_len=14400}], 17) = 58632

Buf's additional flexibility wouldn't help me in any significant way that I can see. I can't represent the whole response with a single Buf because there's the possibility of error while reading from a file. [edit: mostly in opening the file. Old files get cleaned up to make space for new ones, and it's possible a backing file is deleted by the time the client reaches that portion of the mp4::File. Similarly, generating part of the index can fail if that database row has been deleted.] Buf::bytes_vectored doesn't return a Result so it can't be lazy when the reading is fallible.

scottlamb commented 7 months ago

[Moonfire NVR passes mmap-backed chunks to hyper] today, but I'm not really sure it's a good idea because of the page fault problem.

FYI, I stopped for this reason (see scottlamb/moonfire-nvr#88). Moonfire still uses mmap, but now only within a dedicated per-disk thread. That thread does a memcpy into chunks backed by the global allocator. I did not find using all of (Bytes, mmap, spinning disks, tokio) in one place to be a winning combo.

I still do the "more complex shared ownership / minimizing allocations" thing a little but wouldn't have bothered if I had started with this mmap design from the start. I don't have numbers on how much it's saving me, but probably not that much, especially given that the portions I read from HDD have per-read chunk allocations now anyway.

NobodyXu commented 7 months ago

That thread does a memcpy into chunks backed by the global allocator.

I think you could ask the kernel to load everything at once, to avoid page fault?

Copying mmap into memory sounds worse than just reading

scottlamb commented 7 months ago

I think you could ask the kernel to load everything at once, to avoid page fault?

MAP_POPULATE? Yes, but there's no guarantee it won't page it back out again unless you also mlock.

Copying mmap into memory sounds worse than just reading

It was faster for me. YMMV. There are several factors. One somewhat surprising one was discussed in this article a while ago: userspace memcpy can be faster than the in-kernel memcpy because it uses SIMD registers.

NobodyXu commented 7 months ago

MAP_POPULATE? Yes, but there's no guarantee it won't page it back out again unless you also mlock.

There's also no guarantee that your heap allocated memory will stay uncompressed in main memory, if user have enabled swapping, or configured something like zswap

I think madvice would help preventing the page out.

One somewhat surprising one was discussed in this article a while ago: userspace memcpy can be faster than the in-kernel memcpy because it uses SIMD registers.

Thanks, that makes sense.

tokio-rs / bytes

[Question] Next steps about vtable and mmap? #359

1. mmap

2. more complex shared ownership / minimizing allocations