Minimum Viable Thumbnails

livepeer / go-livepeer

Official Go implementation of the Livepeer protocol

http://livepeer.org

MIT License

546 stars 173 forks source link

Minimum Viable Thumbnails #2384

Open iameli opened 2 years ago

iameli commented 2 years ago

Getting JPEG thumbnails out of Livepeer streams is a common request. While thumbnailing can (and has) been built on a variety of different parts of the pipeline, it's a natural fit for Livepeer Network transcoding; when orchestrators are already processing an entire video stream, the marginal cost of encoding a single-frame JPEG image is negligible.

The following is a proposal for a good-enough system for producing JPEGs that can be shipped relatively quickly.

Core Transcoding Workflow

I propose that we implement this as a new MJPEG video codec alongside our existing support for H264, HEVC, VP8, and VP9. MJPEG as a format is nothing more than concatenated JPEG files, so the common case of one-thumbnail-per-segment can be represented as "a single frame of MJPEG." Shipping support for MJPEG instead of JPEG allows us to reuse all of our configuration parameters for video streams instead of implementing new semantics for representing still output. It also opens the door for utilizing actual multi-frame MJPEG in the future (@Thulinma has some cool ideas for this.)

The first version of JPEG output will use a single CPU jpeg encoder. This is to minimize complexity and maximize development speed for the first iteration of thumbnailing. While this will require a copy from the GPU to the CPU in the case of hardware-accelerated transcoding, and we usually try and avoid those to avoid saturating 1x PCIe links, it's expected that copying a single keyframe per segment won't max out the PCIe bus. A future iteration of this feature could implement nvJPEG for Nvidia cards and similar measures for other hardware.

I have no opinion about the best choice of encoders within ffmpeg for this, but it appears to be supported.

Wire Protocol

We will need to represent this as a new capability on the wire protocol, Capability_MJPEG_Encode. I hope most of the rest of the semantics will be the same?

JSON Configuration

For -authWebhookUrl responses/Livepeer-Transcode-Configuration structs, we will need to add an "encoder": "MJPEG" to the profiles struct. Single-frame-per-segment semantics can be represented as an MJPEG encoding with an arbitrarily low framerate:

{
  "name": "thumbnail",
  "encoder": "MJPEG",
  "bitrate": 44000,
  "fps": 1,
  "fpsDen": 10000000,
  "height": 720,
  "width": 1280
}

The above would generate a single 1280x720 video frame at approximately 44kb, which is the size of a random YouTube thumbnail I just downloaded.

Open Questions

I proposed MJPEG instead of JPEG to try and make things simpler. Is it actually doing that, or should we take a different route?
While we're in the neighborhood, do we want to support any other video codecs? WebP, perhaps?

AlexKordic commented 2 years ago

MJPEG is good idea.

Besides "encoder": "MJPEG" we could add "quality": 0.35 parameter with nice default value to produce 1280x720@44kb output.

jpeg is supported everywhere, contrast to WebP or HEVC in HEIF

iameli commented 2 years ago

Just remembered that our fps field is represented as a fraction; updated my example to make use of the fpsDen field to more properly represent a tiny number.

Thulinma commented 2 years ago

I support @AlexKordic 's suggestion to support a quality parameter, and would even go so far as to say the bit rate should be left out entirely here. (Or at least make it clear which overrides which.)

Also which format will this be returned in? MJPEG has multiple "standards" among which a few common implementations, but ideally we wrap it into something well-understood like MKV. When returning only a single frame, plain JPG without anything around it makes sense, too.

MikeIndiaAlpha commented 2 years ago

I second @Thulinma caveat that there isn't really a MJPEG "standard", it is more like "actual graphic data is encoded as raw JPEG data (which is NOT what the people usually call jpeg BTW, because they really mean JFIF file format)", and then "every man for himself".

What people mean by thumbnails, @iameli? Perhaps just one thumbnail for a segment, or maybe just a few files? Alternative solution may be encoding .h264 stream with just IDR frames. The functionality provided would be more or less the same, allowing for arbitrary seek, and both stream and container formats are well defined then.

Finally: I agree that SW MJPEG encoder may be good approach for trying it, but most HW encoder platforms do support at least JPEG encode, and having JPEG encode one can create one of MJPEG flavours cheaply. Also the bandwidth cost would be smaller because we'd be only sending compressed bitstream across the bus.

AlexKordic commented 2 years ago

Good point regarding mkv output is we keep the timestamp of the original frame. Player logic implementing thumbnail browsing can read each thumbnail and place it properly without need to know where segments start/end and avoid hardcoding whether Livepeer transcoder extracts thumbnail on first or last frame of segment.

iameli commented 2 years ago

Also which format will this be returned in? MJPEG has multiple "standards" among which a few common implementations, but ideally we wrap it into something well-understood like MKV.

@Thulinma I see — so the standard is a bit more complex than I realized; I was thinking it was just literally concatenated JFIF files. That makes things a bit more complex.

When returning only a single frame, plain JPG without anything around it makes sense, too.

Yeahhhh this is going to be a common use case of the broadcaster — a single accompanying JPEG with each segment. Kinda nice to be able to get at it without transmuxing an MKV. But I wouldn't want it to magically change formats based on whether there's more than one frame present in the output; it'd be weird if you had like fps: 1 set and got different output formats based on whether you sent in a one-second segment or a three-second segment.

If MKV is expedient I'm fine with just using it every time, including for the single-frame case.

I support @AlexKordic 's suggestion to support a quality parameter, and would even go so far as to say the bit rate should be left out entirely here. (Or at least make it clear which overrides which.)

I don't have strong opinions about this except to say I'm not sure how we'd enforce a quality parameter in a decentralized setting. Bitrate is at least objective.

What people mean by thumbnails, @iameli? Perhaps just one thumbnail for a segment, or maybe just a few files? Alternative solution may be encoding .h264 stream with just IDR frames. The functionality provided would be more or less the same, allowing for arbitrary seek, and both stream and container formats are well defined then.

@MikeIndiaAlpha The most immediate use case that we have is populating thumbnails for NFT minting. Notice how all of these cool video NFTs just have the Livepeer logo as their image? That's a huge shame, and so we a use case for actual JPEGs even if other schemes could provide thumbnail-ish functionality. Nice to be able to embed with an <img> tag rather than a <video> too.

Good point regarding mkv output is we keep the timestamp of the original frame. Player logic implementing thumbnail browsing can read each thumbnail and place it properly without need to know where segments start/end and avoid hardcoding whether Livepeer transcoder extracts thumbnail on first or last frame of segment.

@AlexKordic Cool, I'm aligned on MKV if that makes this easier.

Thulinma commented 2 years ago

We'll have timestamps regardless of container - Mist will keep the JPGs as a sychronized extra track 💪 We'd like plain JPG format - since, as @MikeIndiaAlpha mentioned, there are multiple standards, and there is one that can be trivially converted between MJPEG and "normal" JPG files. This standard is the only one support in modern browsers (AFAIK) and also used in the MKV container format, among others. There is no MJPEG standard for TS containers AFAIK. 🤔

I'm fine with "magically" changing format depending on frame count! Mist will support all combinations in all cases anyway, so makes no difference there in terms of complexity - and we actually gain a bit of efficiency in the 1-frame-only case all around. I'd say let's do it. (Though the overhead of putting MKV around it is not massive or anything... still, any non-zero overhead is overhead we didn't really need.)

thomshutt commented 2 years ago

Kinda nice to be able to get at it without transmuxing an MKV

This was my initial thought too, what if we drop the fps param for the initial version - keep it as simple as possible and do one frame per segment (whatever the segment size is)?

victorges commented 2 years ago

Kind of newcomer to everything video so sorry if I get anything completely wrong here.

While this will require a copy from the GPU to the CPU in the case of hardware-accelerated transcoding, and we usually try and avoid those to avoid saturating 1x PCIe links, it's expected that copying a single keyframe per segment won't max out the PCIe bus.

Does this also mean that we will only support thumbnails from keyframes/beginning of the segment? If not, how would one specify where to grab the thumbnail from? Maybe re-use some of the clipping parameters that we have? On the B->O wire at least, not sure about the auth webhook response. Also not sure how one would specify relative positioning like "the middle frame of the segment" tho.

If we do intend to support keyframe-only thumbnails, could that be a problem? I wondered about short-video NFTs that have a single ~10s segment for example (if that is a thing), which then could have bad keyframes to be used as a thumbnail. For multi-segment ones we could at least try to pick one from the middle of the video in the processing of the file instead so it doesn't seem to be such of a problem. Not sure if it's an unrealistic problem not worth trying to solve tho.

I'm fine with "magically" changing format depending on frame count! Mist will support all combinations in all cases anyway, so makes no difference there in terms of complexity

Out of context here but I'd prefer the single format output just to have a stabler interface, simpler to document etc. We might have other clients to the broadcaster/Livepeer network as well so we don't want it to be extra complex to implement one. One of such clients right now is the task-runner itself which processes VOD assets (that create those NFTs) and the transcode-cli which also pushes segments directly to the broadcaster (and also related to "VOD" in the general sense).

It might be the case that we want mist to be the only entrypoint to the network tho, supporting all those VOD file-processing use-cases as well. In that case we would have more freedom on the protocol here.

I propose that we implement this as a new MJPEG video codec alongside our existing support for H264, HEVC, VP8, and VP9.

Off-topic: wait, do we already support those additional codecs in the lp network? What is missing for us to add official support in livepeer.com for streaming with those codecs? https://livepeer.com/docs/guides/start-live-streaming/support-matrix If we do make that change, the experience of using the webrtmp-sdk can get drastically better! As in working in almost any browser not only Chrome on Desktop 😃

AlexKordic commented 2 years ago

do we already support those additional codecs

This depends on hardware (nvidia). H265-HEVC, VP8, and VP9 are newly supported as ingest. H264, H265 as transcoded output.

how would one specify where to grab the thumbnail from?

Every 15 seconds example:

{
  "name": "thumbnail",
  "encoder": "MJPEG",
  "fps": 1,
  "fpsDen": 15,
}

Also possible on sub second intervals. Clamped to nearest frame.

I wondered about short-video NFTs

We always do decode on input video. Because of that we are not limited on keyframes.

Out of context here but I'd prefer the single format output just to have a stabler interface

We do plan to expose simple interface from Mist. Discussing here how to preserve all crucial info between Mist and transcoder.

iameli commented 2 years ago

Thanks for all the input, y'all.

Format

@Thulinma:

I'm fine with "magically" changing format depending on frame count! Mist will support all combinations in all cases anyway, so makes no difference there in terms of complexity

@AlexKordic

We do plan to expose simple interface from Mist. Discussing here how to preserve all crucial info between Mist and transcoder.

There may (and should!) come a day where we're exclusively using Mist as the input here, but until that happens we'll also need to support input from cli-transcoder and stuff — there's not a great way to process a VoD suitable for our needs through Mist at the moment.

@thomshutt

This was my initial thought too, what if we drop the fps param for the initial version - keep it as simple as possible and do one frame per segment (whatever the segment size is)?

If we got that way, I'd want to make sure we also ship clipping (https://github.com/livepeer/go-livepeer/pull/2280) so that we're not limited to picking the first frame of a segment, per @victorges' concern.

Alternative to MKV: what if we just return multiple JPEG files? This is not so hard to do given the multipart return structure that we already have, though I'm less familiar with the websocket improvements that @AlexKordic is working on.

Keyframe Segments

@victorges

Does this also mean that we will only support thumbnails from keyframes/beginning of the segment?

No, I think I kind of misspoke in saying a single "keyframe" would need to make the copy; I mean to say it's just a single decoded frame that will need to do so. Because the decoding happens on the card, we should be able to pick wherever we want.

If not, how would one specify where to grab the thumbnail from? [...] Also not sure how one would specify relative positioning like "the middle frame of the segment" tho.

Clipping parameters would work for this for sure! Provided you're aware of the duration of the video (which I presume you are, because Content-Duration) you could then just specify a timestamp in the middle of the segment. The other thing you could do is generate one thumbnail per second ({"fps": 1}) and show all of them to the user, allowing them to pick what actually gets minted.

thomshutt commented 2 years ago

though I'm less familiar with the websocket improvements that @AlexKordic is working on.

I think that it's going to just be a control message / binary payload format, so should be able to do anything that the multipart can