Transcoder Design - Githubissues

j0sh commented 6 years ago

Two primary options when it comes to transcoder design.

Single threaded, straight through: each profile is encoded serially. Closely matches the current behavior of the transcoder. Note that FFmpeg currently doesn't go much over 100% CPU when transcoding; in many cases this actually leads the transcode job running slower than real-time. Broadcast latency suffers as a result.

This is the quickest to implement, but does not offer any parallelization potential. It also should not be any worse than what we have now.

Split the processing into stages.

Demuxing
Decoding
Rescaling (video) or Resampling (audio)
Encoding
Muxing

The benefits of splitting are twofold.

We can calculate the optimal code-path that a given profile should take. For example, if the input and output have the same codecs and resolution, we can simply transmux. If only the codec differs, we can skip rescaling. Different output formats that share the same encoding profile (eg, mpegts and mp4) can re-use the same encoded packets [1].
Allows for increased concurrency: each component can run independently, including having a thread for each rescaler, encoder and/or muxer.

There are three ways we can implement this splitting:

Entirely in C. Simplest approach architecturally, although it would likely require additional scaffolding to handle the bookkeping and interaction between each component (eg, a thread-safe queue).
Bind each stage individually to Cgo, with the bookkeeping done in go-land. We achieve concurrency via goroutines. This approach seems the neatest in principle, but there is concern about hitting GOMAXPROCS and creating contention with the scheduler. In general, 'long running' Cgo routines are supposed to be detached to avoid counting against GOMAXPROCS [2][3], but the granularity and frequency of our Cgo calls might work against us [4]. Note that each transcoding profile could have up to 5 Cgo entry points (rescaler, resampler, video-encoder, audio-encoder, muxer), so our current 3-profile output could have 17 Cgo entry points (demuxer, decoder, 3x5-profile). While we should still achieve some semblance of work interleaving, this approach might actually make things worse if it disrupts the scheduler too much.
Rust to manage the bookkeeping and concurrency. Since Rust can expose a C-compatible FFI, this can be bound with Cgo as well as a single entry point. Drawback is this adds a rather large component to the build system that might not be justified.

[1] Technically we'd have to run a bitstream filter to convert between MP4's Annex B format and a transport stream, but that's a much lighter operation than a re-encode.

[2] Working off the assumption that we shouldn't be overriding the user's GOMAXPROCS for them eg, by wrapping the livepeer go-binary in a shell script.

[3] https://github.com/golang/go/issues/8636#issuecomment-66098275

[4] Not to mention that transitioning the Go-Cgo boundary is slower, although not sure how much that would affect us in practice.

ericxtang commented 6 years ago

A few questions:

I understand the current implementation is single-threaded, however, it's single-threaded on a segment level. This means we can still use multiple cores when we are transcoding multiple segments at the same time right?
Do we understand what causes the processing time to be longer than real-time? I'm guessing it's more CPU-bounded?
What's the benefit of using Rust here? Does it make bookkeeping and concurrency much easier?

My personal preference here is to take the most direct path for now. (single-threaded approach) I believe GPU-based transcoding will be the next big performance booster (especially in the mining context). This will require some more work around verification methods (quantitative comparison of things like perceptual hashing, VMAF), but I know it's possible, and it'll quickly over-shadow CPU-based transcoding. Are some of these optimization techniques still applicable in a GPU-based workflow?

j0sh commented 6 years ago

I understand the current implementation is single-threaded, however, it's single-threaded on a segment level. This means we can still use multiple cores when we are transcoding multiple segments at the same time right?

Correct, we can encode multiple segments in parallel, while keeping the encode single-threaded for each segment. That's not how it works right now with the CLI, however; outputs are encoded sequentially.

Do we understand what causes the processing time to be longer than real-time? I'm guessing it's more CPU-bounded?

Yes, CPU bound.

On an (un)related note, this is a potential protocol-level vulnerability that might need to be addressed: transcoders deliberately oversubscribing their capacity and not getting new segments out in a timely manner. More often this would be the result of inadequate hardware, not malice. In the case of the latter, the 256-block claim limit helps mitigate the risk somewhat, and transcoders don't want to be making too many claims. Haven't checked this part of the protocol myself, but could transcoders get away with trickling out segments and leaving viewers hanging for minutes at a time?

What's the benefit of using Rust here? Does it make bookkeeping and concurrency much easier?

A major design goal of Rust is to ensure compile-time thread safety. We would have a single transcode entry point in Cgo, and manage as many threads within that as needed, outside the Go runtime. Additionally, Rust's expressive types offer strong compile-time guarantees, so we might as well leverage that for the 'bookkeeping' such as calculating and building the flow graph. Again, this massively increases the surface area of the build system, which is probably not a tradeoff we want to make at the moment.

Are some of these optimization techniques still applicable in a GPU-based workflow?

There are some workloads in the transcoding pipeline that might benefit from GPU (such as colorspace conversion), but encoding generally benefits more from SIMD (AVX) or fixed function hardware (QuickSync). That being said, FFMpeg already supports the Intel MediaSync SDK which I believe is able to run certain operations on the (Intel?) GPU natively. I'm hoping that enabling MediaSync support is as simple as installing the library and setting the ffmpeg configure flag. We'd likely need run-time hardware detection as well.

GPUs might help more with verification, but it'd depend on the method we choose.

My personal preference here is to take the most direct path for now. (single-threaded approach)

That's the most expedient approach for now. I'll write it to minimize the refactoring required to transition to a multithreaded approach, which should yield a better structured codebase anyway.

ericxtang commented 6 years ago

@j0sh made a wiki page for the transcoder design - https://github.com/livepeer/lpms/wiki/Transcoder-Design

ericxtang commented 6 years ago

Questions:

Can we detect it when multiple streams are in the container?
About GOP - if we want to enforce a segment length (let's say 2s chunks), can we allow both 1s GOP and 2s GOP? (Or do we only allow 2s GOP?)
Can we detect GOP length that we don't support?

j0sh commented 6 years ago

Can we detect it when multiple streams are in the container?

Yes, the trouble would be in selecting the appropriate stream. There is an API to "find the best stream" for a given media type, but there's no guarantee that will pick the correct one. https://gist.github.com/j0sh/39d7edadb526dece0df1679dd68588c1#file-transcoder-c-L214

About GOP - if we want to enforce a segment length (let's say 2s chunks), can we allow both 1s GOP and 2s GOP? (Or do we only allow 2s GOP?)

We can have 1s GOP with a 2s segment. Having GOPs length > segment length is where it becomes problematic. Or there might not even be a closed GOP, eg for encoders set to use periodic intra refresh / gradual decoder refresh.

Can we detect GOP length that we don't support?

We could (if the muxer doesn't warn already), but I'm not sure what our options would be after that point.

ericxtang commented 6 years ago

I've gotten the advice of "be more restrictive on the video input side". Intuitively it makes sense - the more flexible we want to be, the more complex our system becomes. I'm ok with dropping the connections unless they are in the specific formats we support.

j0sh commented 6 years ago

Added a note to the wiki saying this

⚠️ We should stop the job if there is more than one stream per media type in an input.

Presumably we will need to validate that the broadcaster is indeed sending the transcoder invalid input, so the transcoder doesn't get slashed for not doing the work.

Will experiment a bit with the GOP to see what our options are there.

ericxtang commented 6 years ago

Currently the transcoder doesn't get slashed for not doing the work. They are economically incentivized to do the work so they can make fees, and their stats will be published so they have "social pressure" to do work. Of course there is a potential problem with a malicious broadcaster creating many jobs that contain bad video, but there is a cost to that.

dob commented 6 years ago

Just wanted to chime in here to remind everyone to keep deterministic transcoding in mind for the time being. I don't think the multithreaded approach as described would prevent that since it's one thread/phase rather than multiple encode/decode threads simultaneously for a single stream, but just wanted to point out that our current verification method depends on the transcode result being predictable bit for bit.

livepeer / lpms

Transcoder Design #51