raspberrypi / firmware

This repository contains pre-compiled binaries of the current Raspberry Pi kernel and modules, userspace libraries, and bootloader/GPU firmware.
5.19k stars 1.68k forks source link

Avoiding MMAL excessive buffering #496

Open ghost opened 9 years ago

ghost commented 9 years ago

It looks like you can just keep feeding input buffers to MMAL, and it will immediately return most of the input buffers, all while not outputting any frames. I managed to send 127 buffers to the h264 decoder, even though the input and output pool both have only 20 buffers, and I didn't feed it new output buffers (i.e. the decoder only got 20 output buffers in total). It must be buffering a lot of data internally.

I'd like to avoid feeding it that much, and would rather block a while until I get a new frame.

Is there a way to determine whether MMAL will output a new frame with the input buffers sent so far? MMAL would have to answer with either "yes" or "potentially needs more input data".

popcornmix commented 9 years ago

Ping @deborah-c as I've discussed this requirement before.

I think the answer is currently no, and it's not simple to achieve. The real issue is codein (part of video decode hardware) has a largish buffer (1-2M) which for low bitrate parts of stream may contain many frames of video. We don't know for sure how many pictures are present until this buffer has been processed. We also don't know how much data is required to output a picture. With hierarchical B frames you may need to submit many frames before getting the first frame out.

I have a similar issue with kodi's use of mmal. I handle it by looking at timestamps being submitted to mmal and the timestamps of pictures returned. If this gets large (e.g. 1 second) I stop submitting new frames until the difference has reduced.

See: https://github.com/popcornmix/xbmc/blob/newclock5/xbmc/cores/VideoPlayer/DVDCodecs/Video/MMALCodec.cpp#L897

ghost commented 9 years ago

Yes, I think the point is that the MMAL API user can't really make a good guess when it should stop or start sending input. My thinking is that this should be much easier for the decoder itself, as it knows whether it's actually processing something, or just waiting for new data. I'm hoping it's possible.

This would probably require new API. Alternatively, the input buffer sizes could reflect the size of the internal decoder's buffer (which would allow the user to estimate the buffering), but I bet this would be a complicated solution.

popcornmix commented 9 years ago

We'll see what @deborah-c says. We did discuss a new api to limit the number of "userdatas" (metadata attached to the encoded video packet) and so block when too many packets were inside codec, but there are complications. The main one is deadlock if you request fewer userdatas than is required to output a picture (which with hierarchical B frames can be quite a large number).

popcornmix commented 8 years ago

I've implemented something that could help. video_decode has a number of callback structures (96) that get attached to packets submitted to HW codec and get returned when frame is decoded. Restricting the number of callback structures it can use will have the effect of limiting the number of frames that can exist in the hardware codec.

The new parameter MMAL_PARAMETER_VIDEO_MAX_NUM_CALLBACKS can limit the number. Note: if set to too low a number (e.g. less than number of pictures in DPB) video_decoder will stall.

Because you may not know the size of DPB, You can set to -1 which uses the DPB size. -2 will use one more than the size of DPB. In testing a number of files, -3 seemed to be required. You may want to go higher for performance reasons.

See: https://github.com/popcornmix/xbmc/commit/9e8ab7da6474992462d716d0c84b8ab9bcb1cc48

ghost commented 8 years ago

Sounds very interesting!

Why did you change the buffer parameters away from "recommended"?

popcornmix commented 8 years ago

There is an issue in kodi for live streams (e.g. PVR) where it attempts to estimate clock skew between back end and front end by monitoring it's internal queue to the hw decoder. The ability for mmal to swallow a large number of packets caused it problems (often the queue would empty completely).

The recommended values are still recommended for most use cases. You want to keep the hw codec's buffers as full as possible to avoid ever underrunning. But having 20 x 80K buffers does mean during a low bitrate part of stream you may quickly swallow 20 frames in the queue. Fewer larger buffers doesn't allow that to happen (but may not utilise memory as efficiently).

In reality none of this should be required, and I'm sure could be fixed elsewhere in kodi, but sometimes just changing your behaviour to be more like everyone else's is the simpler option.

Still not sure if this will fix the issue (debugging remote user's PVR issues is not easy), or if the performance impact will be problematic (possibly I'll have to make this change live stream only in the future), but it's another option to play with if you don't like the existing behaviour.

ghost commented 8 years ago

Hm, then why not the buffer_size_min/buffer_num_min? Or would that be too extreme on the other end, and not buffer enough data?

To be honest, I think a media player which requires buffering of input data should actually buffer it somewhere else, and not in the decoder itself. The decoder should only buffer as much is required to keep its internal pipeline busy. But maybe others would disagree.

popcornmix commented 8 years ago

I think buffer_size_min=2048 and buffer_num_min=1 so for large video frames there would be many round trips to gpu which would hurt performance.

How much is needed to keep it's internal pipeline busy though? On a Pi1 when arm cpu usage is high there could be hundreds of milliseconds before new data appears.

All of this can be done from arm side. It makes little difference if mmal_queue_timedwait blocks because the the codec is refusing to consume data, or if you sleep explicitly when the timestamps of submitted frames exceed the timestamps of returned frames by some margin. Ideally you may need to know the number of pictures in dpb and framerate of video to ensure codec has enough data, but it should be possible.

ghost commented 8 years ago

I wanted to use this to achieve fixed buffering in the ffmpeg MMAL decoder. But it doesn't work: I can't find a way to distinguish a discarded frame from a frame that takes unexpectedly long to decode. I can't just use a large timeout either, because 1) that can stall decoding (without getting new input), 2) even if it gets input leads to more buffering, even if the buffering is strictly speaking outside of the GPU, 3) is not truly reliable.

I looked at the kodi code, and even that doesn't seem to get away with timeouts and heuristics that double-guess what's going on inside of the GPU.

I suppose I could somehow try decoding a few frames more than needed to achieve the fixed delay, and then when it "underruns" consider it a dropped frame. But even then it's hard to tell exactly when it "underruns". It could be that the input is still a bit "in flight". The decoder could still be working, but there's no way to tell. Would it be enough to check the input queue buffer to know whether the decoder is busy? This all sounds fragile and racy. Timeouts will just interrupt playback if they block, or cause timeout and cause glitches with high bitrate video. To make it even worse, discarded frames are not all that uncommon. (Although the decoder seems to repeat frames for some packets that contain no frames.)

I could modify the libavcodec API to output 0 to N frames per decode call (instead of the forced "lock step" packet->frame data flow). Then I could just copy all the kodi hacks. But even then I'd have to live with that timeout, and variable buffering. (While I don't mind the variable buffering, it prevents me from determining AVI timestamps - which either requires a fixed delay so that a fixed size PTS buffer can be used, or exact information about which frames were dropped.)

Maybe I'm just missing something? Any tips?

popcornmix commented 8 years ago

About discarded frames - what do you set MMAL_PARAMETER_VIDEO_DECODE_ERROR_CONCEALMENT to? I set it to false, and discard them on arm side when (buffer->flags & MMAL_BUFFER_HEADER_FLAG_CORRUPTED)

ghost commented 8 years ago

I don't set it at all. It's possibly what I'm looking for - what exactly does it do?

popcornmix commented 8 years ago

When error concealment is enabled corrupted pictures are not returned. At start of stream the state is considered corrupt until an IDR or recovery point is seen. If a frame has (some category of) errors it is considered corrupt again (until next IDR).

Disabling error concealment means you see every picture and can choose whether to display them or not (using the MMAL_BUFFER_HEADER_FLAG_CORRUPTED flag).

ghost commented 8 years ago

What happens with packets that do not contain a frame at all (e.g. vop_coded=0)?

popcornmix commented 8 years ago

Not certain. @deborah-c or @6by9 may be able to say.

6by9 commented 8 years ago

Sorry, too far down the stack for my knowledge.

ghost commented 8 years ago

MMAL actually does repeat the previous frame for packets with no frames. But sometimes it still drops output. I didn't verify what's actually in the packets in this case, though, or what's different about packets which skip output.

Here is a sample file: https://0x0.st/X18.avi (plays with omxplayer but shows A/V desync; plays fine in Raspbian's packaged kodi).

ghost commented 8 years ago

Also, on the subject of excessive buffering and avoiding timeouts, I think it would be ideal if input buffers were returned only after the corresponding packet was decoded. (Apparently this is not the case.)

Then an API user could actually know whether a frame just takes longer to decode, or if it was dropped.

I think currently there is no way of detecting this situation.

ghost commented 8 years ago

And in addition, silently discarding packets make it impossible to reassociate the timestamp for avi with b-frames (avi has no proper PTS).

6by9 commented 8 years ago

Also, on the subject of excessive buffering and avoiding timeouts, I think it would be ideal if input buffers were returned only after the corresponding packet was decoded. (Apparently this is not the case.)

You've now increased the minimum input buffer count to be some variable value dependent on the stream and reference frames. How do you set that ahead of time without making it arbitrarily high? And it needs to factor in the size of those frames too as they may get fragmented into multiple buffers. So potentially double or treble the number of reference frames to be presented, "just in case".

OpenMAX IL and MMAL are both designed around streaming data down pipelines, not breaking it down into constituent elements.

Sorry, not going on the list of things I'll look at (video decode isn't my area of knowledge anyway). Up to @popcornmix if he wants to look at it.

ghost commented 8 years ago

You've now increased the minimum input buffer count to be some variable value dependent on the stream and reference frames.

I guess the input buffers work more like a unix pipe, rather than a packet queue? (The difference is that the former is just a FIFO that shuffles the data across, rather than having some inherent logic about its contents.)

It doesn't have to work like I suggested. I just want a way to know when a packet was discarded, instead of resulting in an output frame.

In any case, something is needed to get bounded buffering without running into deadlocks in corner cases. I also need something to determine AVI timestamps.

OpenMAX IL and MMAL are both designed around streaming data down pipelines, not breaking it down into constituent elements.

That's completely orthogonal. The fixed-size output ring buffer doesn't really match with what you describe, by the way. Output buffers are discrete elements, and also the user has to take care to recycle them, so they're not just "streamed down" a pipeline.

6by9 commented 8 years ago

I guess the input buffers work more like a unix pipe, rather than a packet queue? (The difference is that the former is just a FIFO that shuffles the data across, rather than having some inherent logic about its contents.)

Sounds like a reasonable comparison, though with hints available as to where frame boundaries may be.

It doesn't have to work like I suggested. I just want a way to know when a packet was discarded, instead of resulting in an output frame.

Others would be able to correct me on this, but I believe internally there is a callback when the codec has finished with a frame. If so, then it may be possible to produce an empty IL/MMAL buffer if the codec has discarded a frame. As long as you pay attention to nFilledLen / length before processing a frame, that would give you your signalling for AVI timestamps without upsetting contents. The encoder certainly works in that manner if rate control decides not to use any bits on a frame.

ghost commented 8 years ago

Others would be able to correct me on this, but I believe internally there is a callback when the codec has finished with a frame. If so, then it may be possible to produce an empty IL/MMAL buffer if the codec has discarded a frame.

That would be reasonable. The idea is similar to how MMAL_PARAMETER_VIDEO_DECODE_ERROR_CONCEALMENT works.

As long as you pay attention to nFilledLen / length before processing a frame, that would give you your signalling for AVI timestamps without upsetting contents.

Not sure if I understand. What is nFilledLen?

6by9 commented 8 years ago

That would be reasonable. The idea is similar to how MMAL_PARAMETER_VIDEO_DECODE_ERROR_CONCEALMENT works.

Video_decode is not my area, but reading back I see popcornmix commented https://github.com/raspberrypi/firmware/issues/496#issuecomment-179434227

what do you set MMAL_PARAMETER_VIDEO_DECODE_ERROR_CONCEALMENT to? I set it to false, and discard them on arm side when (buffer->flags & MMAL_BUFFER_HEADER_FLAG_CORRUPTED)

So have we already got that mechanism in place delivering a buffer with flags set.

Not sure if I understand. What is nFilledLen?

https://github.com/raspberrypi/userland/blob/master/interface/vmcs_host/khronos/IL/OMX_Core.h#L421 IL equivalent to https://github.com/raspberrypi/userland/blob/master/interface/mmal/mmal_buffer.h#L81 length in MMAL. The Videocore components are almost all common between IL and MMAL, so I was considering both options.

ghost commented 8 years ago

So have we already got that mechanism in place delivering a buffer with flags set.

Yes, but it might be different because a corrupted frame is still a frame, which might contain a partially decoded picture, and which users even might want to see. In this case, the frame would be just empty. Not a problem in my opinion, though.

6by9 commented 8 years ago

Too far into codec detail for me again. I'll have a quick look tonight at the video_decode component code to see if there is a path that results in a codec callback not producing an output buffer (assuming it is working in the manner I vaguely recall).

6by9 commented 8 years ago

Too involved for a quick solution from me. There are times when a callback structure isn't assigned, and that includes if the timestamp is presented as TIME_UNKNOWN. So an unknown time on a frame to be discarded frame may result in no output buffer. Given time to take a bundle of logs and analyse them then I might understand some of the code paths here, but I'm working on other things at the moment.

ghost commented 8 years ago

Thanks for taking a look anyway.

I still hope some mechanism can be provided in the future for determining whether a packet was discarded. (Or something similar.)

popcornmix commented 8 years ago

Have you tried using MMAL_PARAMETER_VIDEO_MAX_NUM_CALLBACKS ? See: https://github.com/popcornmix/xbmc/commit/12aa2a809545846c83bb9fbb8d34fba90860b411

It's working well in latest kodi builds and ensures minimal frames are buffered in codec. Make sure a flag is set on every input packet (e.g. buffer->flags = MMAL_BUFFER_HEADER_FLAG_USER0) which ensures every frame gets a userdata callback structure (which is what is being limited).

ghost commented 8 years ago

Have you tried using MMAL_PARAMETER_VIDEO_MAX_NUM_CALLBACKS ?

Yes. It works, but doesn't quite achieve what I want. The subtle difference is that while it limits the maximum, I still can't ensure that it stays close to the desired buffering (without that performance would suffer). My problem with silently discarded packets/frames is that they will decrease the buffering each time. While it's easy to estimate the number of frames buffered in the decoder, you can't get a definite number due to the asynchronous behavior. In the end, distinguishing a dropped frame and a frame that takes longer to decode seems impossible (at least if you want to be certain).

Make sure a flag is set on every input packet (e.g. buffer->flags = MMAL_BUFFER_HEADER_FLAG_USER0) which ensures every frame gets a userdata callback structure (which is what is being limited).

I wasn't aware of such a behavior. Currently I'm not setting MMAL_BUFFER_HEADER_FLAG_USER0 on any buffers. What exactly would it change? As far as I'm aware, I do get callbacks for all buffers sent to the ports.

popcornmix commented 8 years ago

The userdata callback structures are used internally by video_decode for returning metadata with the decoded pictures. MMAL_PARAMETER_VIDEO_MAX_NUM_CALLBACKS limits the number of userdata callback structures in the pool, which causes blocking (and hence a bound on number of frames inside decoder) which is desirable in some cases.

However under some circumstances the metadata is not requires.

      // list of flags that we'll not pass along to the next component
      useless_flags = OMX_BUFFERFLAG_ENDOFFRAME|
         OMX_BUFFERFLAG_STARTTIME|OMX_BUFFERFLAG_CODECCONFIG|OMX_BUFFERFLAG_EXTRADATA|
         OMX_BUFFERFLAG_FRAGMENTLIST;

      // Check if we need to add a callback structure to the data being decoded
      // There basically isn't any need to add one if it won't contain anything useful
      if (in->hMarkTargetComponent || // buffer has a mark
          !(in->nFlags & OMX_BUFFERFLAG_TIME_UNKNOWN) || // buffer has a timestamp
          (in->nFlags & ~useless_flags)) // flags we need to pass along

So, if you have a timestamp, or a non-useless flag then all is good. If not, then the limiting of buffer won't apply. Always setting MMAL_BUFFER_HEADER_FLAG_USER0 does guarantee the allocation of a callback structure and so ensures the blocking behaviour.

ghost commented 8 years ago

On what will it block? If the number of output callbacks reaches the configured max limit? (I don't think this will help me in the situation I've described.)

I'm not sure if I fully grasp the consequences of setting the user flag. I'll play with it a bit next week or so.

popcornmix commented 8 years ago

The excessive buffering issue reported is because the video decode hardware has a large buffer (1-2MB) of input data. That means it can swallow many dozens of frames when bitrate is low. The decoded frames returned may be a few seconds older than the current encoded frames being submitted.

MMAL_PARAMETER_VIDEO_MAX_NUM_CALLBACKS provides a means of limiting the number of frames inside the decoder to avoid this issue. However it only apples to frames that require user metadata to be returned. Setting MMAL_BUFFER_HEADER_FLAG_USER0 is a simple way of ensuring every frame has user metadata.

The MMAL_BUFFER_HEADER_FLAG_USERx flags allow flags to be attached to frames submitted that can be seen in the returned pictures. Here is an example where it is used for dropping decoded frames when requested to support inexact seeking. https://github.com/popcornmix/xbmc/commit/70a0bda8200ed7c8c154e953b1a33c52c0cae0db

ghost commented 8 years ago

OK, I suppose I was observing mostly correct buffering because I usually set a timestamp on the buffer.

Here is an example where it is used for dropping decoded frames when requested to support inexact seeking.

Completely unrelated, but couldn't the decoder skip decoding of flagged frames if they're non-ref frames?

popcornmix commented 8 years ago

Completely unrelated, but couldn't the decoder skip decoding of flagged frames if they're non-ref frames?

Yes. MMAL_BUFFER_HEADER_FLAG_DECODEONLY could be used in this situation, and then gpu will not return the frame at all. However Kodi seems happier when a frame is returned with DVP_FLAG_DROPPED compared to no frame at all.

If I wasn't fitting in with Kodi's expected behaviour MMAL_BUFFER_HEADER_FLAG_DECODEONLY would be a more efficient solution (marginally with opaque pointers or zero copy, significantly if YUV data is being returned).

ghost commented 8 years ago

MMAL_PARAMETER_VIDEO_MAX_NUM_CALLBACKS provides a means of limiting the number of frames inside the decoder to avoid this issue. However it only apples to frames that require user metadata to be returned. Setting MMAL_BUFFER_HEADER_FLAG_USER0 is a simple way of ensuring every frame has user metadata.

Tried it - this didn't help in this specific case. It still skips some frames in the sample file I have. (But maybe it's not meant to fix this specific case.)