Consider refining "3.5.8 Coded Frame Processing" Step 1.14 part 2 for multiple frame per MediaSample

w3c / media-source

Media Source Extensions

https://w3c.github.io/media-source/

Other

267 stars 59 forks source link

Consider refining "3.5.8 Coded Frame Processing" Step 1.14 part 2 for multiple frame per MediaSample #269

Closed aToshioOgasawara closed 3 years ago

aToshioOgasawara commented 3 years ago

I'm encountering a MediaSample leak issue on WebKit, which was filed as a WebKit Bug ticket Bz222448.

When overlapped MediaSamples are composed of multiple frame data with DTS order and the following Condition A) or Condition B) is valid, the overlapped MediaSamples are not deleted; such MediaSamples are leaked.

Condition A): A MediaSample's PTS + its duration exceeds the next MediaSample's PTS
Condition B): The I-Frame start position of a newly appended MediaSample doesn't match the I-Frame start position of a MediaSample which is already buffered.

I think the isuue comes from an assumption that MediaSamples can handle plural frames but the eviction process seemingly expects that a MediaSample should be linked to one frame. To make the SourceBuffer eviction algorithm more general, the eviction process should check the end of MediaSample instead of presentation timestamp (PTS).

In the patch attached in Bz222448, it changes a description of "3.5.8 Coded Frame Processing" Step 1.14 part2.

Current: If highest presentation timestamp for track buffer is set and less than or equal to presentation timestamp
Updated: If highest presentation timestamp for track buffer is set and less than frame end timestamp

The change is that "or equal to presentation" is replaced with "frame end".

I got an advice in Bz222448 from an Apple engineer as follows:

It looks like we do correctly set "highestPresentationTimestamp" to the "frame end timestamp" later in step 1.19, but I do wonder if there were other changes to the specification around "frame end timestamp" that may have been missed. So your proposed change would be a willful departure from the text of the specification. Should this be brought up to the MSE spec authors first? Or is there another way to solve the issue you're attempting to fix?

I think of the descrpiotion change is necessary for generalizing the "3.5.8 Coded Frame Processing" though, there might be other relevant part to be concerned by following this change. I'd like to discuss the possibility of this change and keeping compatibility to relevant MSE specifications.

Reference:

Bz222448 [MSE] Overlapping MediaSamples are not deleted
"3.5.8 Coded Frame Processing" Step 1.14 part2

aToshioOgasawara commented 3 years ago

Sample URL: https://tama.tok.access-company.com/public/WebKit/sample00

In this sample, 0s-6s video's MediaSamples overlap.

wolenetz commented 3 years ago

From MSE spec point of view, can you describe a "MediaSample"? Is it a coded frame? Multiple coded frames? An encoded GOP? I just want to make certain of terminology being used so I can fully understand the question.

aToshioOgasawara commented 3 years ago

From MSE spec point of view, can you describe a "MediaSample"? Is it a coded frame? Multiple coded frames? An encoded GOP? I just want to make certain of terminology being used so I can fully understand the question.

"MediaSample" is a coded frame. My understanding is that a code frame can have multiple samples.

wolenetz commented 3 years ago

This is all rather complicated, so let's begin by simplifying terminology scope, hopefully: "MediaSample" is not an MSE term. Especially in the context of the coded frame processing algorithm, MSE uses "coded frame" which has a defined PTS, DTS, duration.

Current spec text for 3.5.8 step 14 differs from both, above:

If highest end timestamp for track buffer is not set:... If highest end timestamp for track buffer is set and less than or equal to presentation timestamp:...

Does this actual current spec text help resolve the problem for you?

aToshioOgasawara commented 3 years ago

Thank you for your advice. From now on, when I ask MSE spec questions, I will use "coded frame" instead of "MediaSample".

The point of this issue is the "Coded Frame Duration" of the video.

Coded Frame Duration The duration of a coded frame. For video and text, the duration indicates how long the video frame or text should be displayed. For audio, the duration represents the sum of all the samples contained within the coded frame. For example, if an audio frame contained 441 samples @44100Hz the frame duration would be 10 milliseconds.

For example, if the "Coded Frame" is composed of frames whose PTS is not in ascending order, as shown below.

Coded Frame A): frame00 PTS:0 DTS:0 Duration:1 frame01 PTS:3 DTS:1 Duration:1 frame02 PTS:1 DTS:2 Duration:1 frame03 PTS:2 DTS:3 Duration:1 frame04 PTS:6 DTS:4 Duration:1

Coded Frame B): frame05 PTS:4 DTS:5 Duration:1 frame06 PTS:5 DTS:6 Duration:1 frame07 PTS:9 DTS:7 Duration:1 frame08 PTS:7 DTS:8 Duration:1 frame09 PTS:8 DTS:9 Duration:1

"Coded Frame Duration" of the video is "how long the video frame should be displayed",I calculated it as follows.

Coded Frame Duration = Largest PTS - Earliest PTS + Largest PTS Duration.

Calculate the "Coded Frame Duration" of Coded Frame A).

Largest PTS = 6 Earliest PTS = 0 Largest PTS Duration = 1 Coded Frame Duration = 6 - 0 + 1 = 7

Is there a issue with this "coded frame Duration" calculation?

aToshioOgasawara commented 3 years ago

If highest end timestamp for track buffer is not set:... If highest end timestamp for track buffer is set and less than or equal to presentation timestamp:...

The PTS, DTS, and Duraition of "Coded Frame A)" and "Coded Frame B)" are calculated as follows. Coded Frame A): PTS:0 DTS:0 Duration:7 Coded Frame B): PTS:4 DTS:5 Duration:6

The value of "highest end timestamp for track buffer" when processing Coded Frame B) is "PTS + Duration = 7" of "Coded Frame A)". In this case, the "highest end timestamp for track buffer" will be greater than the PTS of "Coded Frame B)", so the Coded frame will not be removed.

I think of the following descrpiotion change is necessary for generalizing the "3.5.8 Coded Frame Processing" .

Current: If highest end timestamp for track buffer is set and less than or equal to presentation timestamp
Updated: If highest end timestamp for track buffer is set and less than frame end timestamp

wolenetz commented 3 years ago

@https://github.com/w3c/media-source/issues/269#issuecomment-806469652 I'm a bit confused. A coded frame has a single PTS. It sounds to me like the described coded frames "A" and "B" are actually groups of coded frames "group A" and "group B". In this regard, the invididual coded frames within each of groups A and B appear interleaved in PTS in some cases in the example. To help me understand the change being requested: 1) Does the following describe the sequences of coded frames in each of the two groups? All of Group A can be decoded coherently in order by DTS, if buffered in isolation of everything else (e.g., appendBuffer(init segment + bytestream segment containing just Group A) followed by seeking to the buffered range start of the result and playing from there). Likewise for Group B. None of A requires any of B to be able to be decoded and played; and the same for B. 2) Which coded frames within the example groups are random-access-points? 3) If describing an ISOBMFF bytestream containing these coded frames, which kind of SAP Type are being used? Note that MSE only supports SAP Types 1 and 2. 4) Are all of the frames that are random-access-points (question 2, above) independently and fully decodable without any other information than what might be in the initialization segment previously appended? (e.g., are they keyframes)?

I know I am being a bit pedantic here, but given the potential for confusion in terminology and how the algorithm applies, I would like more detail, and I appreciate your patience :) (and my apologies for the delay in this response).

aToshioOgasawara commented 3 years ago

Thank you for your comment.

1.Does the following describe the sequences of coded frames in each of the two groups? All of Group A can be decoded coherently in order by DTS, if buffered in isolation of everything else (e.g., appendBuffer(init segment + bytestream segment containing just Group A) followed by seeking to the buffered range start of the result and playing from there). Likewise for Group B. None of A requires any of B to be able to be decoded and played; and the same for B.

The first frame00 is the I-frame, followed by the P-frame and the B-frame. (I will answer in detail in Question 2) They require other frames to do the decoding.

2.Which coded frames within the example groups are random-access-points?

「Coded Frame A):frame00 PTS:0 DTS:0 Duration:1 」is random-access-points. Other than frame00, it is not a random-access-points.

Coded Frame A): frame00 PTS:0 DTS:0 Duration:1 I-frame frame01 PTS:3 DTS:1 Duration:1 P-frame frame02 PTS:1 DTS:2 Duration:1 B-frame frame03 PTS:2 DTS:3 Duration:1 B-frame frame04 PTS:6 DTS:4 Duration:1 P-frame

Coded Frame B): frame05 PTS:4 DTS:5 Duration:1 B-frame frame06 PTS:5 DTS:6 Duration:1 B-frame frame07 PTS:9 DTS:7 Duration:1 P-frame frame08 PTS:7 DTS:8 Duration:1 B-frame frame09 PTS:8 DTS:9 Duration:1 B-frame

3.If describing an ISOBMFF bytestream containing these coded frames, which kind of SAP Type are being used? Note that MSE only supports SAP Types 1 and 2.

I am using SAP type 1.

4.Are all of the frames that are random-access-points (question 2, above) independently and fully decodable without any other information than what might be in the initialization segment previously appended? (e.g., are they keyframes)?

It is independently and fully decodable. They are keyframes.

aToshioOgasawara commented 3 years ago

I didn't understand the CodedFrame spec.

https://bugs.webkit.org/show_bug.cgi?id=226481#c9

For video frames, I understand in the comments above that multiple frames are not packaged in the same CodedFrame.

We made a mistake in the design for CodedFrame.

Thank you for your cooperation. So this ticket is closed.

wolenetz commented 3 years ago

@ https://github.com/w3c/media-source/issues/269#issuecomment-862134811 - yes, a coded frame == 1 frame, not a group of more than 1 frame. This is where terminology confusion happened, I think. If it helps, in Chrome MSE buffering, we keep a notion associated with each parsed and buffered coded frame of whether or not it is a keyframe, and each buffered keyframe has an optional sequence (in decode order as parsed from the appended bytestream and processed by the coded frame processing algorithm) of nonkeyframes. The per-track collection of these keyframes in presentation timestamp order (and any associated nonkeyframe sequences for each keyframe and their aggregate presentation interval) is the internal buffering abstraction. Some complications like handling SAP-type-2 groups of frames (where a nonkeyframe presentation time is prior to its keyframe's, and possibly other nonkeyframes prior to it in decode order) and overlapping append-removals make this non-trivial.