livepeer / go-livepeer

Official Go implementation of the Livepeer protocol
http://livepeer.org
MIT License
537 stars 165 forks source link

proposal: stream recording #1565

Closed iameli closed 3 years ago

iameli commented 4 years ago

Abstract

Users want to record streams. Our existing object stores get us most of the way there; let's make them a bit more generic and then implement the specific features needed for this case.

Prior Art

This spec is an attempt to unify a few different OS-related issues:

Motivation

There are a few different situations where we'd want to interact with object stores.

  1. The currently-implemented workflow whereby the OS becomes the primary source of data
  2. For debugging purposes we may want to upload segments that fail to transcode (https://github.com/livepeer/go-livepeer/pull/1398)
  3. Users want to record their streams. While this can be partially accomplished by the current workflow, it is undesirable to add an object store in the middle of a functional transcode workflow and add latency. Rather, we would prefer to have a broadcaster concurrently push the video to the OS and the orchestrator.

We currently have 1 and part of 2.

Three-part proposal. Part 1 proposes updating the syntax for OSes so they can be referred to in a more generic way. Part 2 discusses the corresponding webhook syntax that allows customization of OSes on a per-stream basis. Part 3 discusses a new type of OS, the -recordObjectStore, that behaves in a manner suitable for recording livestreams.

Proposed Solution

TLDR: Change OS to have the form s3://ACCESS_KEY_ID:SECRET_ACCESS_KEY@eu-central-1/testbucket. Those get called by -objectStore, -failObjectStore, and -recordObjectStore. Implement per-stream webhook OS configuration.

Part 1: Refactor OS syntax into one unified URL-based option.

EDIT: Refactored out to https://github.com/livepeer/go-livepeer/issues/1572, please take discussion there.

Part 2: Allow for webhooks to customize OS on a per-stream basis

Utilizing the above, the syntax for this becomes very easy:

{
  "objectStore": "s3://ACCESS_KEY_ID:SECRET_ACCESS_KEY@eu-central-1/testbucket",
  "failObjectStore": "file:///home/iameli/failed-segments"
}

That's not to say the implementation will be easy, but I think the spec is pretty clear.

(Currently -failObjectStore is only implemented for OTs, but it would be a good feature to have on Bs as well.)

Part 3: Record Mode

The above will allow for customizability of OS but it doesn't address the "recording" use case, as discussed above. So, another CLI parameter and webhook field:

livepeer -broadcaster -recordObjectStore s3://ACCESS_KEY_ID:SECRET_ACCESS_KEY@eu-central-1/testbucket
{
  "recordObjectStore": "s3://ACCESS_KEY_ID:SECRET_ACCESS_KEY@eu-central-1/testbucket"
}

The behavior of streams with -recordObjectStore enabled is as follows:

  1. After recieving an incoming segment from a client, upload the segment to the OS in parallel with your upload to your chosen O.
  2. After getting data back from the O, return that data to the client (in the multipart case). In parallel, upload the transcoded segments to the OS.
  3. After all uploads for a segment complete, push new m3u8 manifests for each rendition + a primary m3u8 manifest that references all the renditions. These manifests should contain all the segments transcoded in the stream, not just the most recent ones.

Implementation Tasks and Considerations

These will become their own tickets and would probably be carried out by different people.

  1. Refactor CLI to use the new syntax (maintaining backwards compatibility)
  2. Refactor existing S3-compatible OS work to the new syntax
  3. Refactor existing failed segment work to the new syntax
  4. Add support for customizing OSes on a per-stream basis via webhook requests
  5. Add recordObjectStore
  6. Add support for all of the above to the REST API

Testing Tasks and Considerations

Certainly there should be tests of per-stream OS and whatnot but I don't think there are any unique testing challenges here.

Known Unknowns

How do we handle manifest recording when streams switch broadcasters mid-broadcast? Is there a node-first answer to this question, or must it be handled on the infrastructural side? Should we "load" existing segments from a manifest if we find it in the OS when we start handling a new stream? Wouldn't that break everything if we stream to the same manifestID twice in a row? What about gaps in sequence number and whatnot?

The above means I kinda want the ability to send media and manifests to different endpoints? The media gets stored; the manifests get parsed and combined as appropriate? I dunno.

EDIT: How should we handle non mpeg-ts output? Uploading a bunch of MP4 files somewhere sounds fine, but what about manifests?

Alternatives

EDIT: Moved to https://github.com/livepeer/go-livepeer/issues/1572

f1l1b0x commented 4 years ago

Having the O uploading the files might introduce a security/trust issue in the public network since the B will want to validate before something airs or gets stored.

wohlner commented 4 years ago

Related to https://github.com/livepeer/go-livepeer/commit/0b455a18053200c2f6b7e1f5139af5ba4d02328d

wohlner commented 4 years ago

To @f1l1b0x's point, does the B validate all segments on the public network before live streaming playback? I don't want recording to inadvertently mess with the verification workflow? If we are bypassing verification, let's be conscious and upfront about that decision.

f1l1b0x commented 4 years ago

Yes but only in Public network the B will do a validation step for all segments before it airs to check the transcoding job has been done and the content has not been tampered. We are in the final steps of finalizing that no reference validation.

iameli commented 4 years ago

Yeah — I mentioned the existing behavior of -failObjectStore because that's how it works right now but I think pretty much all these features are designed for the B. Having first-class support for recording on Os would be kind of sketchy.

iameli commented 4 years ago

Responding to my own question regarding how we handle recording across broadcaster switches...

Is there a node-first answer to this question, or must it be handled on the infrastructural side?

We could append a timestamp to the path where we send our streams, e.g. we push to the directory /{manifestId}-{unix timestamp}.m3u8 and /{manifestId}-{unix timestamp}/720p/165.ts. Then even if there are broadcaster shifts, it's a simple matter for someone to parse all the m3u8 files for a given manifest ID and come out with a contiguous stream of segments.

iameli commented 4 years ago

We could even have a livepeer -osPlayback {osUrl} mode to enumerate and recombine the manifests, serving out a consistent HLS stream on the 8935 HTTP server. Boom, node-first VoD.

j0sh commented 4 years ago

We should also support providing the key as an inline JSON blob

My guess is there might be URL-library issues if plonking the blob directly within the password field. Will probably need to be escaped at the very least; hopefully we won't also encounter length issues.

the O uploading the files might introduce a security/trust issue in the public network since the B will want to validate before something airs

does the B validate all segments on the public network before live streaming playback? I don't want recording to inadvertently mess with the verification workflow?

@wohlner @f1l1b0x Uploading results to OS doesn't necessarily mean it'll get inserted into a playlist right away. If verification is enabled and fails for a given segment, then the segment gets retried and until the policy retry cap is hit and / or verification passes. In fact, having segments that failed verification be in persistent storage may be useful for later analysis.

Recording a segment will only occur if transcoding succeeds - which includes the verification step, if enabled - so there isn't a conflict there.

The behavior of streams with -recordObjectStore enabled is as follows:

I would also add the following condition:

This avoids additional processing in the default case ; the -recordObjectStore would only be needed if the recording OS (or path) differs from the default. This really just highlights that the recording feature + OS are really decoupled features; recording needs a spec of its own.

My guess is that we'll need to have a couple other flags around recording anyway (eg, to specify the recording format and / or output file name) and using those will enable recording.

Probably good to have this type of scheme for the fail upload path as well. My guess is the fail upload will mostly be useful if the primary store is non-persistent (eg, filesystem that clears segments after each job).

After all uploads for a segment complete, push new m3u8 manifests for each rendition + a primary m3u8 manifest that references all the renditions. These manifests should contain all the segments transcoded in the stream, not just the most recent ones.

Again, this probably really belongs in a separate spec for the recording feature, rather than OS, but in short:

This is problematic with storage systems that exhibit read-after-write consistency (eg, S3) where updates may still return stale reads for some time. So we can't really expect the full manifest to be immediately available after each segment. (In the current iteration the filesystem storage driver, we avoid writing manifests to external OS for this reason.) Not to mention that appending to a non-windowed live HLS playlist still means rewriting the entire playlist from top to bottom, which is asymptotically horrible without serious tweaking.

The easy way around this would be to just write a manifest once after the job terminates, as the recording is being finalized.

Should we "load" existing segments from a manifest if we find it in the OS when we start handling a new stream?

Do you mean, should we append to existing playlists?

That is a separate product question, but a lot of these issues have been raised earlier in issues such as https://github.com/livepeer/go-livepeer/issues/869 - might be good to narrow down the discussion there.

How should we handle non mpeg-ts output? Uploading a bunch of MP4 files somewhere sounds fine, but what about manifests?

This is another recording-specific issue but MP4s and non-segmented formats pose some specific challenges (and we have users that are already using MP4; the existing recorder outputs MP4)

Users expect MP4s to be concatenated and we can't efficiently do that coming from segmented sources. Even with fmp4 and something like the AWS multipart upload API (does an equivalent exist for other cloud OS services?), there is still a "completion" step where the upload is finalized.

There's a few ways this finalization issue is going to present itself:

One way around this is to have a "fixup" step at startup. For MP4s in pull mode, we could inspect the (most likely, filesystem) OS at node startup and attempt to regenerate a recording from leftover segments that weren't correctly cleaned up.

For broadcasters with non-persistent volumes or kube-style workloads that bounce around, an external OS might be necessary. But there are a lot of other gotchas here with respect to reconstructing that state - don't have any firm suggestions right now.

iameli commented 4 years ago

Reading through @j0sh's comments, I should clarify that for all of these use cases I was imagining the use of an external OS.

Again, this probably really belongs in a separate spec for the recording feature, rather than OS

True. I wanted to have a single spec as a starting point to serve as the path forward for all the tickets and PRs I listed, but once we get to a vague consensus as to what's a good idea I can break this down into a variety of smaller specs and tickets.

This is problematic with storage systems that exhibit read-after-write consistency (eg, S3) where updates may still return stale reads for some time. So we can't really expect the full manifest to be immediately available after each segment. (In the current iteration the filesystem storage driver, we avoid writing manifests to external OS for this reason.)

That's okay — nobody's expecting to be able to play back from this OS instantly and get good results. (In the workflow I'm imagining, anyway.) VOD playback can come in later after the consistency has had time to settle. For the recording feature, we just need an archive.

Not to mention that appending to a non-windowed live HLS playlist still means rewriting the entire playlist from top to bottom, which is asymptotically horrible without serious tweaking.

Horrible why? So the manifests are this over and over...

#EXTINF:2.000,
/stream/cf082c5f-2c25-4be5-9c33-6719601b567b/480p/3081.ts

That's 72 bytes 1800 segments per hour 5 renditions * let's say a 48 hour stream = 16 megabytes per second of manifest pushing to keep updating at that point. Okay, yeah. That's horrible. Let's figure something else out.

The easy way around this would be to just write a manifest once after the job terminates, as the recording is being finalized.

We just lost the VODs on every server where a network connection drops or the kernel panics. 😞

How about we upload a series of timestamped manifests, resetting periodically? Manifests could get pushed to {os}/record/{rendition}/{manifest id}-{timestamp}.m3u8. If I run with -recordManifestSegments 1800, we roll over to a fresh manifest every 30 minutes. Using the same assumptions as above, that'd be 64k per manifest, not too bad to push that every segment. Then the "finalization" step consists of enumerating and combining all of the timestamped manifests, simple enough.

If recording is enabled, but no recording OS specified, then use the default OS. Maybe under a default prefix, eg "/recordings/".

Sounds great. Maybe /vods.

Do you mean, should we append to existing playlists? That is a separate product question, but a lot of these issues have been raised earlier in issues such as #869

I hadn't seen https://github.com/livepeer/go-livepeer/issues/869, awesome! I'll read through that discussion and factor out full-playlist related issues over there.

j0sh commented 4 years ago

nobody's expecting to be able to play back from this OS instantly and get good results.

This might still be an issue if we need to reconstruct a final playlist during the finalization step, especially because the settlement period is rather indefinite.

We just lost the VODs on every server where a network connection drops or the kernel panics.

Not necessarily - see the later notes about having a "fixup" step. We'll have the segments somewhere, and can reconstruct some semblance of sequentiality from there.

But when multiple broadcasters are involved, I'm also concerned about ordering inconsistencies - timestamps aren't always reliable (whether wall clock timestamps, or taken from within the stream). Sequence numbers might work as a preliminary heuristic, but I don't know if these are always reliable coming from Mist (and was kind of hoping to get rid of client-supplied sequence numbers at some point).

Note that this becomes a lot easier if Livepeer RTMP ingest or node-first is used, because the stream is a lot less likely to be bouncing around nodes. Otherwise it's going to take a while to fully work out the kinks here.

How about we upload a series of timestamped manifests, resetting periodically ... we roll over to a fresh manifest every 30 minute

Hmm, if we're overwriting an existing manifest within that 30 minute span, then what does rolling over buy us?

One concern I have here and (and in general, with non-segmented formats such as MP4) is that this will leave a lot of intermediate file clutter. That should be cleaned up. For external OS, that also means incurring a bunch of priced requests, egress, delete permissions, etc. All that will need to be carefully managed.

iameli commented 4 years ago

Sequence numbers might work as a preliminary heuristic, but I don't know if these are always reliable coming from Mist (and was kind of hoping to get rid of client-supplied sequence numbers at some point).

They are over the scope of a single retryable RTMP stream. Good enough to get started.

How about we upload a series of timestamped manifests, resetting periodically ... we roll over to a fresh manifest every 30 minute

Hmm, if we're overwriting an existing manifest within that 30 minute span, then what does rolling over buy us?

Well, these are full #869 manifests, and we refresh them with every segment, so that means after 48 hours we'd be pushing ~32 megs of manifest every two seconds. So it buys us a lot of bandwidth.

The workflow works like this:

Each broadcaster uploads to /{manifest id}-{timestamp}-{nonce per broadcaster}.m3u8. The timestamp resets every half an hour on every broadcaster. Let's say for whatever reason there's a lot of retrying and manifestId foo is bouncing around between three broadcasters or something. After 90 minutes, we might have files that look like this:

foo-0-123.m3u8 # broadcaster A
foo-0-456.m3u8 # broadcaster B
foo-0-789.m3u8 # broadcaster C
foo-1800-123.m3u8 # broadcaster A
foo-1800-456.m3u8 # broadcaster B
foo-1800-789.m3u8 # broadcaster C
foo-3600-123.m3u8 # broadcaster A
foo-3600-456.m3u8 # broadcaster B
foo-3600-789.m3u8 # broadcaster C

Maybe there should be leading zeroes in there for lexical ordering, but that's the idea. Then anyone looking at the OS that wants to play back foo can enumerate these and combine 'em.

One concern I have here and (and in general, with non-segmented formats such as MP4) is that this will leave a lot of intermediate file clutter. That should be cleaned up. For external OS, that also means incurring a bunch of priced requests, egress, delete permissions, etc. All that will need to be carefully managed.

Not a lot of clutter. In the worst-case example I just gave, that's nine manifest files that correspond to 2700 segment files! Not to mention they're tiny compared to the video — this is an acceptable amount of metadata to facilitate what we're discussing here.

iameli commented 4 years ago

Refactored out the OS syntax proposal to https://github.com/livepeer/go-livepeer/issues/1572, please take discussion for that issue over there ✌️

j0sh commented 4 years ago

They are over the scope of a single retryable RTMP stream.

Couple cases I can think of where segment numbers aren't exactly reliable:

For direct RTMP ingest:

For HTTP push or node-first ingest:

Note that for orchestrators using broadcaster-supplied object storage, we have them write into a random prefix in order to prevent front-running or overwriting other transcoders. The scheme is something like this:

/manifestID/nonce/rendition

We might want multiple broadcasters to do something similar, especially with cloud storage. This allows us to distinguish segments on the basis of upload time or whatever, and use that as another heuristic in addition to the sequence number.

Not a lot of clutter.

Anytime the user wants a concatenated file - eg, a proper MP4, which people are already using - or something byte-range addressable (because single blobs are a lot easier to manage), then we'll have a lot of intermediate files to clean up after. It's unavoidable to have intermediate files somewhere, but it feels a bit weird to be using cloud storage as that intermediate layer. But maybe unavoidable for now.

Then anyone looking at the OS that wants to play back foo can enumerate these and combine 'em.

With eventual consistency the settlement period is indefinite. And because it's indefinite, we can't really be sure when the final version of the manifest(s) would be ready. And we still need a finalization step anyway to combine everything.

All this is leading me to think that for "fixup" approach is the way to go here. The first version of an object will be available immediately after writing completes for the first time, but we don't know when the final version of an object settles.

Rather than try to gather manifests and take a guess at when they'll be finalized, just gather a list of the segments that are available, and build a playlist from that. Segments only need to be written once. Again, there are API and egress costs with cloud storage which makes the whole thing a bit non-ideal, but at least it'll work reliably without fundamentally unresolvable edge cases.

Maybe /vods

Slight preference for /recordings still, since in my head a VOD is the thing that is being transcoded, while a recording is the output of a VOD or a live transcode. (But I get how we could be producing a VOD of a live transcode.)

darkdarkdragon commented 4 years ago

RTMP stream disconnects, and reconnects to Node B.

It will be new 'recording session'

Node A receives a segment from the source. Takes a long time to complete. Source decides to retry with node B. Node A eventually completes.

Node A shouldn't write anything to the OS in that case.

darkdarkdragon commented 4 years ago

Also, for HTTP push it is not clear when to do finalisation step.

iameli commented 4 years ago

They are over the scope of a single retryable RTMP stream.

Couple cases I can think of where segment numbers aren't exactly reliable:

I was referring to Mist's behavior from a single server, but point taken.

  • The same segment from both node A and node B are written. Which one to choose?

In that case, both overwriting and not overwriting is completely acceptable; it oughta be the same content.

I certainly get that there are a lot of cases where sequence number can't be relied on. We've got folks using Livepeer for VOD transcoding where they only ever do one file per stream and whatnot. But we do have a case, with MistProcLivepeer, where we generally can rely on sequence numbers, and I'd like recording to work in that instance.

Note that for orchestrators using broadcaster-supplied object storage, we have them write into a random prefix [...] We might want multiple broadcasters to do something similar, especially with cloud storage. This allows us to distinguish segments on the basis of upload time or whatever, and use that as another heuristic in addition to the sequence number.

I'd be cool with that — could be very similar to the naming scheme I proposed for manifests, something like /{manifest id}-{sequence number}-{timestamp}-{nonce per broadcaster}.m3u8. Or a timestamp with sufficient resolution that collisions are unlikely.

All this is leading me to think that for "fixup" approach is the way to go here. The first version of an object will be available immediately after writing completes for the first time, but we don't know when the final version of an object settles.

Rather than try to gather manifests and take a guess at when they'll be finalized, just gather a list of the segments that are available, and build a playlist from that.

Yeah, this feels like the most robust solution. Actually, if segments themselves contain timestamps, then "fixup" could be reading through the existing segments and building a manifest for "now". If there are more segments later, could use the previous manifest as a cache and add the new segments.

How do we handle segment metadata? Specifically, duration? That's really the only piece of information that I wanted from a manifest.

Slight preference for /recordings still, since in my head a VOD is the thing that is being transcoded, while a recording is the output of a VOD or a live transcode. (But I get how we could be producing a VOD of a live transcode.)

I'm cool with /recordings

iameli commented 4 years ago

So... manifest/metadata/finalization conversation is ongoing but we've got a couple different workable ideas. As a starting point, everyone cool with doing the following?

That's it. Once that exists and works we can revisit how and when to convert that big mess of segments into something more coherent. @j0sh @darkdarkdragon good starting point?

j0sh commented 4 years ago

RTMP stream disconnects, and reconnects to Node B. It will be new 'recording session'

@darkdarkdragon And therein lies problem. It's a recording session but two nodes think they are responsible for recording.

Also, for HTTP push it is not clear when to do finalisation step.

Naively, finalize when the session expires. But with both RTMP and HTTP push, there's still issue with finalization if multiple nodes end up handling the stream at some point. Do we take the hit of potentially "finalizing" multiple times?

It's better to have a single source of ground truth with a separate recording service.

Node A shouldn't write anything to the OS in that case.

We can't say that for certain. Point is, these things will happen with distributed systems; we can't expect zero overlap in sequence numbering.

Actually, if segments themselves contain timestamps

@iameli Probably not necessary because the OS metadata itself will have a last modified date or some equivalent information.

How do we handle segment metadata? Specifically, duration?

OS metadata. For filesystem storage, that's probing. For S3, it's the x-amz-meta header (HEAD request). This part of why fixups are non-ideal with cloud storage.

All that just points towards fixups being a last resort, and I think we should work towards having a single source of ground truth for recording, eg a separate service.

No change to OS syntax yet, but we do add a -record=true parameter

Record... what? MP4? HLS? Different users want different things. MP4 already works and is in use. At the very least, we'll need to indicate the desired format. I've also had requests to exclude certain renditions. All this needs to be spec'd out.

and saved in a subdirectory, like /{manifest id}/recordings/{rendition name}/

This needs to be thought through some more and spec'd out further. Because if we're already using persistent object storage, then the /recordings/ directory is probably only needed for the finalized artifacts. (And I was thinking it'd be a top level directory, rather than a subdir under /{manifest id} which gets wiped out after each session in the current filesystem OS implementation to remove intermediate files. Otherwise we have to special-case a bunch of things in a way that doesn't really compose well.)

Otherwise we're incurring a double store to basically the same location as compared to the existing OS behavior. Eg, I have a S3 OS for the primary store. Do I need to re-upload each segments to a /recording subdir [1] ? Or if I'm recording, could the recording endpoint also be used as the primary OS for the transcoding flow? (Basically, the only time a separate recording OS makes sense is if the destination differs from the one used for the transcoding flow. Even then, there are concerns around intermediate file clutter I think we can avoid.)

[1] TBH, doing some sort of "move" operation is probably still necessary, especially since orchestrators won't always upload to the most straightforward paths, but is it essential to do that in the middle of the transcoding flow? Mid-flow adjustments might make sense for a recording service, but less so for a broadcast node trying to maximize transcoding throughput. Or is that something that could wait until the finalization phase, especially for concatenated formats?

darkdarkdragon commented 4 years ago

@j0sh

RTMP stream disconnects, and reconnects to Node B. It will be new 'recording session'

@darkdarkdragon And therein lies problem. It's a recording session but two nodes think they are responsible for recording.

What problem? The moment RTMP disconnects - it is 'finalisation' moment, no more recording for Node A. Node B starts from scratch, what Node B will record will be separate piece of video (VOD .m3u8 playlist at the moment).

Otherwise we're incurring a double store to basically the same location as compared to the existing OS behaviour.

Yes. Whole 'record' mode was brought up just not to deal with all the flow of usage of external OS in the transcoding process.

Record... what? MP4? HLS?

Just record everything passing through node, without any transformations. Putting transformations/finalisations into broadcaster node will hurt performance (if, says, node will be doing 'cleanup' on the startup - that means that after the crash node will node be able to restart quickly).

j0sh commented 4 years ago

The moment RTMP disconnects - it is 'finalisation' moment, no more recording for Node A. Node B starts from scratch

What if node A doesn't cleanly terminate? What if there's a transient break in the connection, and the client reconnects, and the load balancer redirects the stream to node B? The user is just expecting one stream from us. Same issue if HTTP push were to use multiple broadcasters.

Yes. Whole 'record' mode was brought up just not to deal with all the flow of usage of external OS Just record everything passing through node, without any transformations

This is where the misunderstanding is, I think because this spec is really incomplete.

Right now, there is a recording mode that will produce a MP4 output as part or the pull branch. This works well, and node-first users are using it, including someone in production. (No, I didn't recommend that, but they were in a rush. And yes, is something that has been specifically requested by several users.)

Here, there's no mention of the actual recording mode(s), for example. Only a subtext that it's just HLS. But that's clearly insufficient.

If the goal was just to "pass everything through the node and into OS" then we already have plain object storage support. But that's clearly not the only goal. Most of the issues here revolve around finalizing that result - into either a playlist or a concatenated file.

Putting transformations/finalisations into broadcaster node will hurt performance

I agree, which is why for the hosted service, we should run the node as a non-transcoding "recording service". Described a number of approaches here https://github.com/livepeer/go-livepeer/issues/869#issuecomment-657797072

wohlner commented 4 years ago

UGC launch requirements

There has been a lil bit of confusion and evolution about the requirements for the recording feature for the UGC launch. I know some of this is repetition from discord and video chats, but to be super explicit, these are the requirements:

Post launch, not now

Post launch, we will need to continue iterating on the recording feature. Some ideas we will want to prioritize after launch are: