Huge GC pressure when serving files with `Cohttp_lwt_unix.Server.respond_file`

mfp commented 9 years ago

respond_file reads the file in 16 KB chunks using Lwt_io.read. This allocates lots of 16 KB strings that go directly to the major heap and become garbage right away, and is bound to trigger major collection fests. See https://github.com/ocsigen/ocsigenserver/issues/49 for an example of what can happen in a similar case...

rgrinberg commented 9 years ago

We should probably just be using send_file in this case. At least where it's available. Since you are knowledgeable on this topic. What would you recommend?

mfp commented 9 years ago

I'm not especially knowledgeable on this, but for sure serving files is what sendfile(2) was made for. It would only help for plain (non-SSL) connections, which is an important limitation.

There are a few complications though:

portability: both sendfile(2) and the accompanying TCP_CORK option are Linux-only
it doesn't seem to fit Cohttp's current (stream-based) model

Regarding (2), after a quick look (this is the first time I look at cohttp's code, so take it with a grain of salt) I believe it might be possible to change both Cohttp_lwt_body.t (so that it can be a reference to a file reifiable as a regular stream) and the output channels used elsewhere in order to perform a sort of stream fusion and replace the read-file-into-stream & write-stream-to-socket operations into a single sendfile loop.

This sounds quite involved, so for the time being simply allocating a string buffer and reusing it (with Lwt_io.read_into) would help a lot. I assume that nobody expects the strings returned by the stream to be immutable (this could be documented in the corresponding interface).

pqwy commented 9 years ago

Another alternative would be to use Cstruct.t. sendfile(2) avoids copying the file into OCaml memory altogether, sure, but just using cstructs mostly bypasses the GC. They are backed by a foreign-allocated Bigarray and are represented on the heap as a 3-slot record, so they are unlikely to get undue promotion out of the young generation.

rgrinberg commented 9 years ago

We don't even need cstruct if we adopt core_kernel for every backend since Bigsubstring is analagous AFAIK.

pqwy commented 9 years ago

Yeah. Anything that wraps Bigarray with offset/pointer so the bulk of it is allocated off the GC heap and it's not expensive to cut.

We use Cstruct.t in TLS pervasively for these reasons.

dbuenzli commented 9 years ago

I'd really prefer cstruct than a pervasive core_kernel dependency, we want portability and fast build times on constrained platforms.

hcarty commented 9 years ago

I second the preference for cstruct over core_kernel.

pqwy commented 9 years ago

@dbuenzli, @hcarty: That's also true. I can't see myself switching TLS and all the related libs to core just for the buffer type anytime soon.

The ecosystem could really use a common bigarray wrapper with zero-copy slicing and a few handy fixed-size extractor funs, that everybody agrees on. Like cstruct, sans syntax extensions, but one that everyone is using...

avsm commented 9 years ago

Cstruct now only optionally depends on camlp4 since 1.5.0, and has an adapter to Async.

mfp commented 9 years ago

On Wed, Nov 26, 2014 at 09:54:52AM -0800, David Kaloper wrote:

Another alternative would be to use Cstruct.t. sendfile(2) avoids copying the file into OCaml memory altogether, sure, but just using cstructs mostly bypasses the GC. They are backed by a foreign-allocated Bigarray and are represented on the heap as a 3-slot record, so they are unlikely to get undue promotion out of the young generation.

I agree that cstructs (or even plain old 1-dimensional bigarrays, maybe with minimal wrapping for offset/length) should be more prevalent in this kind of things (IO buffers and such). The main advantage is that they're not going to be moved around by the GC, so no copying to a temp buffer is needed when performing IO after releasing the runtime.

They are not a panacea however. For instance, non-mmapped bigarrays simply use malloc to reserve the memory and free() it on GC of the bigarray, which cannot be controlled precisely at a reasonable cost -- you have to do Gc.full_major (). This does not generate extra GC load, but can easily cause memory fragmentation (and high RSS peaks), for there's no compaction unlike OCaml's major heap. So you're going to need some care (and maybe a buffer pool) and have to avoid allocating too much anyway --- not unlike with plain old strings, although for different reasons.

Going back to the present issue, if Cohttp_lwt_body.t is changed from Stream of string Lwt_stream.t to Stream of cstruct_like Lwt_stream.t, you definitely want to reuse the cstruct_like buffer. Also, for compatibility with Lwt_io, which cohttp_lwt* builds upon, I think simple 1-dimensional bigarrays (= Lwt_bytes.t) would be best, as in time Lwt_io channels could become able to dump them directly to the underlying fd (if applicable). However, neither Lwt_io.read_into_bytes" norLwt_io.write_from_bytes exist at the moment, so... in fact a single plain old string (+ length) reused throughout the life of the stream is going to be close to optimal in Lwt_io`'s current state.

Mauricio Fernández

pqwy commented 9 years ago

Cstructs are a minimal wrapping of bigarrays with offset/length!

As for Lwt_io.read_into_bytes and its counterpart, they are called Lwt_bytes.read / write, and are what Lwt_cstruct.read / write build upon. It should be relatively straightforward to switch Cohttp_lwt_body.t and friends to cstruct internally, with the damage being api breakage or duplication.

Re: imprecise deallocation and fragmentation.. I have no idea how it would play out. I think those questions are best answered by measurements.

mfp commented 9 years ago

On Wed, Nov 26, 2014 at 11:55:11AM -0800, David Kaloper wrote:

Cstructs are a minimal wrapping of bigarrays with offset/length!

I'm not very lucid today, what I meant was really non-opaque wrapping, for maximal compatibility/interoperability.

As for Lwt_io.read_into_bytes and its counterpart, they are called Lwt_bytes.read / write, and are what Lwt_cstruct.read / write build upon. It should be relatively straightforward to switch Cohttp_lwt_body.t and friends to cstruct internally, with the damage being api breakage or duplication.

Lwt_bytes.read/write operate on Lwt_unix.file_descr, not Lwt_io channels (as needed to read from/write to TLS channels via Lwt_ssl).

We want to avoid allocation in respond_file's stream, but at some point you're going to want to write it to a Lwt_io channel if you're living in the Lwt ecosystem.

This is why, when serving a file with respond_file, you're going to run into this issue as well: https://github.com/mirage/ocaml-conduit/issues/27

Refer also to https://github.com/ocsigen/lwt/issues/98 (as you can see I went on a little bug reporting spree).

Re: imprecise deallocation and fragmentation.. I have no idea how it would play out. I think those questions are best answered by measurements.

Allocating lots of bigarrays and hoping the GC will release them is just asking for trouble. There's a constant in the bigarray lib used to compute the major slice size which will cause up to 1 GB worth of bigarray data to be allocated before a full major GC cycle is completed. The heap is going to become gruyère cheese if you don't try to reuse the bigarrays (and triggering GC runs manually would generate as much work as plain old strings being allocated in the major heap, but that at least adjusts the GC speed automatically and prevents fragmentation...).

Mauricio Fernández

Chris00 commented 9 years ago

Lwt_bytes.read/write operate on Lwt_unix.file_descr, not Lwt_io channels (as needed to read from/write to TLS channels via Lwt_ssl).

It looks like a first move would be to request Lwt to have “allocation free” Lwt_io.read and Lwt_io.write (working on channels) — it would be useful for other purposes too. Async already has that. Then one can ask Ssl to directly support Bytes.t.

Chris00 commented 9 years ago

BTW, while we are at it, IO.read should report differently EOF and an error condition — IMHO it is very important to distinguish the two, one may have to rely on EOF to make sure the complete input was read...

avsm commented 9 years ago

I've opened up #207 to track the IO.read comment -- it's helpful to keep these bug reports separate where possible.

mfp commented 9 years ago

I'm thinking of implementing a trivial buffer pool to prevent allocation of large buffers (both Bytes.t and Lwt_bytes.t/Cstruct.t), starting from conduit and upwards to cohttp and ocsigenserver. I did it in ocsigenserver to great effect (up to ~30% speed up).

This is the API I have in mind (as you can see in the .ml, the implementation is trivial). Reusing buffers entails the risk of accidentally using one after it's been explicitly released, but the places where they are used are clear enough and the speed gains so great it makes it worthwhile IMO (for instance, the buffer passed to Lwt_io.make can be released in the ~close function).

Initially, I was going to just add the buffer pool to conduit, and then reuse it in cohttp and ocsigenserver (at some point, the API could be enriched and generalized, and split off to a separate package). One doubt I have is if it'd be acceptable to expose conduit's buffer pool (I mean the pool "value" itself, not the implementation) so that cohttp and ocsigenserver can reuse it, or if it'd be deemed better to have each one use its own internally (and without exposing it), while reusing the implementation, of course.

In preliminary benchmarks of the ocsigenserver cohttp(_rebased) branch, I've seen it's between ~50% (for very small responses) and about as fast (slowdown going down as size increases, about ~ 5% for 16KB files) as stock ocsigenserver 2.6 (without any of the allocation prevention work mentioned earlier). I think the higher per-connection overhead can easily be explained by cohttp using 16KB buffers (vs. 4 KB in stock ocsigenserver).

Also, regarding the precise problem that motivated this issue, it's similar to this one relative to ocsigenserver, so I'd expect about the same speed increase (>2X faster serving of large files) once addressed. This would build on the buffer pool work for extra effectiveness.

mirage / ocaml-cohttp

Huge GC pressure when serving files with `Cohttp_lwt_unix.Server.respond_file` #207