Open quark-zju opened 4 years ago
For what it's worth, I'm also interested in a Bytes
-like interface to a memory-mapped region.
Hi @spl, FYI I ended up reinventing it. It's not a lot of code if Arc<dyn Trait>
is used.
@quark-zju Nice! Your BytesOwner
is a neat approach. I drafted something using the following:
enum Buf {
Mmap(Mmap),
Static(&'static [u8]),
Vec(Vec<u8>),
}
pub struct SharedBuf {
buf: Arc<Buf>,
offset: usize,
len: usize,
}
Of course, this wasn't intended for general-purpose use as a library.
Could bytes
provide something like the following from @quark-zju's library?
impl Bytes {
pub fn from_owner(value: impl BytesOwner) -> Self { ... }
}
pub trait BytesOwner: AsRef<[u8]> + Send + Sync + 'static {}
I personally think there are very limited types that meaningfully fit the vtable interface (I can hardly think of another aside from the mmap one).
Here's some food for thought. Moonfire NVR currently uses a bytes::Buf
implementation that's a wrapper around reffers::ARefss<'static, [u8]>
. It puts these into a hyper::body::HttpBody
implementation and sends them to hyper
. It does two things with this:
It does this today, but I'm not really sure it's a good idea because of the page fault problem. As lucio-rs suggested here, it might be better to simply copy stuff from the mmaped region to a "normal" bytes rather than having a mmap-backed bytes. My current approach of not dealing with this at all means that the tokio threads may stall on page faults, which is not so great. It's really not good enough to have a bytes object for a mmap()ed region which is only special for loading and unloading, not also when reading from it.
tl;dr version: there's one large object moonfire_nvr::mp4::File
which lives for the entire response, is reference-counted all together via struct File(Arc<FileInner>)
, and has a small number of Vec
s which turn into a larger number of HTTP response chunks. I'd need more allocations (bytes::bytes::Shared
ones) to duplicate what it does with bytes::Bytes
as it exists today.
Some detail, in case you're curious:
mp4.rs constructs a .mp4
file on-the-fly, including:
buf: Vec<u8>
which has a bunch of dynamic file metadata (including stuff like length of structures, timestamps, etc) that gets stuffed between other parts of the file. It's nice to have that backed by the overall mp4::File
rather than have a separate bytes::bytes::Shared
allocation.stts
= sample time table, stsz
= sample size table, stss
sync sample table) It's most efficient to all the indexes for a segment at once (but only if the HTTP request actually asks for a byte range that overlaps with at least one of them) and stuff them into one Vec
. It's nice to not need another per-segment allocation for a bytes::bytes::Shared
.Here's the Debug
output on a typical mp4::File
. This structure gets generated on every HTTP request, then chunks may or may not actually get generated and served depending on the byte range the browser asked for.
My current approach of not dealing with this at all means that the tokio threads may stall on page faults, which is not so great.
This effect may honestly not be horrible, if you're on modern hardware it should be somewhat fast. Really prolonged blocking is the issue.
By modern hardware I assume you mean SSD? Maybe so in that case. I have terabytes of files on cheap hardware, so I use spinning disk instead. I typically assume a seek takes about 10 ms but I just benchmarked it on this system at closer to 20 ms. wikipedia says a seek took 20 ms in the 1980s so basically on this metric I might as well be using 40-year-old hardware. And even the fastest modern hard drives can't do an order of magnitude better. The physics are prohibitive.
wikipedia says a seek took 20 ms in the 1980s so basically on this metric I might as well be using 40-year-old hardware.
Haha, taking commodity hardware to a new level!
You are right, you can't assume what those times might be.
cc @seanmonstar on this issue since he wrote most of the vtable stuff.
@scottlamb I have some questions about the mp4::File
use-case.
When constructing the Bytes
backed by mp4::File
, would it read the entire content of the file as an ordinary Bytes
backed by Vec<u8>
immediately or lazily?
If immediately, what is the benefit of not closing the File
after reading? If the File
is closed, it seems the ordinary Bytes
fits the use-case.
If lazily, how would it actually implement the vtable
interface? The current vtable
struct needs plain ptr: *const u8
and len: usize
. What's their values?
When constructing the Bytes backed by mp4::File, would it read the entire content of the file as an ordinary Bytes backed by Vec
immediately or lazily?
Neither. There's not a single chunk (~ Bytes
struct) for the whole file. A mp4::File
produces a stream of many chunks. In the debug output I linked above, that file had 95 slices. When a HTTP request comes in, my code examines the requested byte range; there's a chunk for the portion of each slice that overlaps with that request, so up to 95 chunks in the stream. Chunks are created on demand via a bounded channel as hyper consumes previous chunks. Each is immediately usable once created.
"Immediately usable"...except for the problem of file-backed chunks causing major page faults that wait for a disk seek. Unless folks have a better idea, I plan to either abandon mmap entirely or copy from a mmaped region to a heap-backed Bytes
instead to avoid this. I once was thinking of having a Bytes
to represent a mlocked chunk of a mmaped file, but having a bunch of mlock
/munlock
calls doesn't seem like a good idea at all post-Meltdown.
The best thing long-term IMHO would be passing around something to cause a IORING_OP_SPLICE
to happen from the file to the HTTP stream (with KTLS for https
). With some fallback for older Linux kernels and non-Linux OSs. But I assume that's out of scope for the bytes crate, and the ecosystem as a whole needs a lot of work before it's possible...
Thanks for explanation. I think the Bytes::as_ref(&self) -> &[u8]
API makes it unsuitable for non-continuous buffers and it is hard to change the API now. For complex backend backed by many chunks, perhaps it can use other abstractions like io::Seek + io::Read
or their async versions.
Buf
is a trait in this crate that has fine support non-contiguous buffers. The Bytes
type is meant to be an API to allow easy sharing and slicing of a single contiguous buffer.
I don't think Bytes
being contiguous is a problem for me.
I built Entity::get_range on top of streams of chunks. While I'm using Buf
today, it's a simple implementation with contiguous buffers like Bytes
has as you can see here:
hyper often will batch up my chunks into a single syscall, as you can see from strace output:
8638 writev(42, [{iov_base="HTTP/1.1 206 Partial Content\r\nac"..., iov_len=449}, {iov_base="\0\0\0 ftypisom\0\0\2\0isomiso2avc1mp41", iov_len=32}, {iov_base="\0\24\r,moov\0\0\0xmvhd\1\0\0\0\0\0\0\0\333\23\270\225\0\0\0\0"..., iov_len=48}, {iov_base="\0\1\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=76}, {iov_base="\0\0\0\3\0\24\5[trak\0\0\0\\tkhd\0\0\0\7\333\23\270\225\333\23\270\225"..., iov_len=44}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=52}, {iov_base="\7\200\0\0\48\0\0\0\24\4\367mdia\0\0\0,mdhd\1\0\0\0\0\0\0\0"..., iov_len=60}, {iov_base="\0\0\0!hdlr\0\0\0\0\0\0\0\0vide\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=33}, {iov_base="\0\24\4\242", iov_len=4}, {iov_base="minf\0\0\0\24vmhd\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0$dinf"..., iov_len=60}, {iov_base="\0\24\4bstbl\0\0\0\226stsd\0\0\0\0\0\0\0\1", iov_len=24}, {iov_base="\0\0\0\206avc1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=134}, {iov_base="\0\r/\20stts\0\0\0\0\0\1\245\340", iov_len=16}, {iov_base="\0\0\0\1\0\0\7\10\0\0\0\1\0\0\16N\0\0\0\1\0\0\16\n\0\0\0\1\0\0\7,"..., iov_len=14400}, {iov_base="\0\0\0\1\0\0\7\24\0\0\0\1\0\0\16x\0\0\0\1\0\0\r\240\0\0\0\1\0\0\16\n"..., iov_len=14400}, {iov_base="\0\0\0\1\0\0\10r\0\0\0\1\0\0\r\34\0\0\0\1\0\0\r\220\0\0\0\1\0\0\6\362"..., iov_len=14400}, {iov_base="\0\0\0\1\0\0\6\370\0\0\0\1\0\0\168\0\0\0\1\0\0\r\374\0\0\0\1\0\0\6\374"..., iov_len=14400}], 17) = 58632
Buf
's additional flexibility wouldn't help me in any significant way that I can see. I can't represent the whole response with a single Buf
because there's the possibility of error while reading from a file. [edit: mostly in opening the file. Old files get cleaned up to make space for new ones, and it's possible a backing file is deleted by the time the client reaches that portion of the mp4::File
. Similarly, generating part of the index can fail if that database row has been deleted.] Buf::bytes_vectored doesn't return a Result
so it can't be lazy when the reading is fallible.
[Moonfire NVR passes
mmap
-backed chunks tohyper
] today, but I'm not really sure it's a good idea because of the page fault problem.
FYI, I stopped for this reason (see scottlamb/moonfire-nvr#88). Moonfire still uses mmap
, but now only within a dedicated per-disk thread. That thread does a memcpy
into chunks backed by the global allocator. I did not find using all of (Bytes
, mmap
, spinning disks, tokio
) in one place to be a winning combo.
I still do the "more complex shared ownership / minimizing allocations" thing a little but wouldn't have bothered if I had started with this mmap
design from the start. I don't have numbers on how much it's saving me, but probably not that much, especially given that the portions I read from HDD have per-read chunk allocations now anyway.
That thread does a memcpy into chunks backed by the global allocator.
I think you could ask the kernel to load everything at once, to avoid page fault?
Copying mmap into memory sounds worse than just reading
I think you could ask the kernel to load everything at once, to avoid page fault?
MAP_POPULATE
? Yes, but there's no guarantee it won't page it back out again unless you also mlock
.
Copying mmap into memory sounds worse than just reading
It was faster for me. YMMV. There are several factors. One somewhat surprising one was discussed in this article a while ago: userspace memcpy
can be faster than the in-kernel memcpy
because it uses SIMD registers.
MAP_POPULATE? Yes, but there's no guarantee it won't page it back out again unless you also mlock.
There's also no guarantee that your heap allocated memory will stay uncompressed in main memory, if user have enabled swapping, or configured something like zswap
I think madvice
would help preventing the page out.
One somewhat surprising one was discussed in this article a while ago: userspace memcpy can be faster than the in-kernel memcpy because it uses SIMD registers.
Thanks, that makes sense.
I have been looking for a zero-copy way of sharing mmap content via
Bytes
. I noticed the vtable feature added by #294 and didn't find follow-ups.I wonder if
bytes
is interested in getting aArc<memmap::Mmap>
version implemented (probably gated by ammap
feature that is default off). Or if other approaches are preferred (ex. expose the vtable interface and implement the mmap support in other crates).I personally think there are very limited types that meaningfully fit the vtable interface (I can hardly think of another aside from the mmap one). So it seems easier for end-users if
bytes
just implements them all (use features to keep the default deps slim). But I can see concerns aboutmemmap
being unmaintained.If the next steps are clear, I can help implementing them.