High memory usage - Githubissues

j0sh commented 5 years ago

Is your feature request related to a problem? Please describe. The go client allocates excessively, and it often takes a long time for GC to run and RAM usage (RSS) to even out - and RSS often evens out at somewhat concerning levels (eg, 75% of total system memory), even if immediate memory needs are nowhere as high.

This flame graph from a stress test exhibits the issue: bytes.makeSlice cumulatively accounts for almost 79% of all allocations, split across two different call stacks. Total allocations for bytes.makeSlice amount to over 332GB across a period of roughly 12 hours with 16 concurrent streams on a single OT node with 8 attached GPUs. This eats up over half the working RAM on a 14GB server.

These allocations are because we slurp up video data into memory as soon as it becomes available. The two call stacks where bytes.MakeSlice is dominant correspond to the points where we receive source inputs (ServeSegment) and produce transcoded outputs (resToTranscodeData):

Describe the solution you'd like

Resolving the issue for each call stack will likely have to be done separately, although there are some commonalities.

ServeSegment

ServeSegment is the portion of the call stack that receives and processes segmented inputs from HTTP. As soon as a request is received, the entire body is read into a byte slice. This branch accounts for 55.85% of total allocations. To resolve allocations around ServeSegment, there are at least two possible approaches:

Use sync.Pool but with caveats. Within ServeSegment, byte slices can be Get'd and the Put deferred until the function exits. Since ServeSegment is synchronous and only exits when transcoding is complete, this works nicely and would neatly release the allocation back to the pool, even after errors. However, a problem arises with the in-memory object store, which also takes a reference to the same data and holds it indefinitely outside the ServeSegment call stack. To mitigate this, the local OS would need to be changed to make its own copy of the data, preferably from another sync.Pool. There are also reasonable arguments to be made that we shouldn't be explicitly maintaining an in-memory local OS, and should defer to a traditional filesystem for serving segments (which could in turn be backed by a tmpfs).
Pass around the HTTP body's io.Reader rather than reading the data into memory immediately. This has the advantage that we don't read the memory until it's needed at the transcoder, but comes with its own set of tradeoffs. Firstly, the local OS would still need some sort of reference to the reader contents, especially for remote transcoders. Secondly, on-chain sig checking is currently done prior to transcoding. We could rearrange the sig check, [tee](https://golang.org/pkg/io/#TeeReader) the reader, and send the resulting Writer into a hash, and compute the sig check as the read completes. However, this doesn't offer a mechanism to short-circuit transcoding if the hash doesn't match up.

resToTranscodeData

resToTranscodeData is the portion of the call stack that receives and processes transcoded outputs. This branch accounts for 22.59% of total allocations.

As part of its output, the transcoder returns a filename of the on-disk transcoded data. The resToTranscodeData function reads this file into a byte slice immediately, then deletes the file. Other than propagation, the transcoded contents is used twice: the stored in object storage, and for computing the hash of the data.

This minimal level of handling may make it OK to simply pass around the file name and use io.Reader-style streaming operations directly from the filesystem, without having to pull all the data into memory.

Another option is to use a sync.Pool, but this means holding everything in memory for a period of time.

However, both approaches present issues with error handling and object lifetimes: since object ownership becomes unclear, who is responsible for the final clean-up (eg, file deletion or returning the bytes to the pool)? Technically our data ends up at local storage, but this isn't intuitive, and the data may still be terminated mid-path by errors, or later code changes easily break this assumption. We may have to use finalizers for this, but this is not likely to improve the predictability of RAM utilization (and may make it worse). This can probably be tuned somewhat with Golang GC configuration, but that introduces an undesirable operational burden on users, and possibly some performance impact.

nyetwurk commented 4 years ago

For large read only files, why not mmap() them, rather than allocate(), read(), process(), free() if conversion to a Reader is impractical?

j0sh commented 4 years ago

The blobs we deal with usually aren't that large, just a few MB each segment. Using mmap or any other manual memory management strategy such as an arena allocator has the same caveats as sync.Pool as described above: we currently don't have full control over the lifetime of allocated objects, so it would take some groundwork before we could safely use such strategies while minimizing overall memory utilization.

TBH, passing around a Reader throughout the processing pipeline might be more error-prone than it's worth, since we have to be really careful with any intermediate accesses or errors. Preferably we wouldn't be pushing blobs through the system so much, but Reader is probably not the correct primitive to substitute with.

livepeer / go-livepeer

High memory usage #1155

ServeSegment

resToTranscodeData