raspberrypi / firmware

This repository contains pre-compiled binaries of the current Raspberry Pi kernel and modules, userspace libraries, and bootloader/GPU firmware.
5.17k stars 1.68k forks source link

video_decode: Internally queues thousands of frames #164

Closed sdroege closed 11 years ago

sdroege commented 11 years ago

Hi,

video_decode is currently able to internally queue thousands of frames. What I mean with this is that you could have thousands of frames passed to the input port (which are immediately copied to the videocore memory) before even outputting a single frame.

This is a bit unfortunate, especially if the feeding of the input port and grabbing of frames from the output port happens on different threads. Ideally there would be a limit of a few frames (20 or something) that could be internally queued at maximum, otherwise it's too easy to exhaust the videocore memory and not being able to use it for other things.

popcornmix commented 11 years ago

VideoCore has an exact copy of the buffers allocated on the host through AllocateBuffer/UseBuffer. That is up to the host how large and how many will be created. VideoCore will not allocate any more buffers at the input port.

sdroege commented 11 years ago

Ok, let me get you a testcase based on hello_video :)

sdroege commented 11 years ago

Indeed I can't reproduce it with hello_video... weird. Let me try to find the bug in my own code then... (hello_video btw accepts 37 input buffers in the default configuration if no output buffer is grabbed, 20 input buffers are allocated and 1 output buffer).

sdroege commented 11 years ago

Or maybe not, I can't reliably reproduce it with hello_video but it seems to be something timing related: http://pastebin.com/6xQ6fULL

First of all in the beginning there's often a burst of input buffers accepted until some kind of limiting happens. Then sometimes it happens that for a single FillBufferDone multiple EmptyBufferDone are emitted, and the number of queued frames overall increases.

When having egl_render tunneled to video_decode this behaviour is far worse (that's where the multiple thousand frames come from) but it can be reproduced with a bit of luck with only video_decode too

sdroege commented 11 years ago

http://pastebin.com/858zURu6

Replace video.c from hello_videocube with that. As you'll see, in the beginning it usually has around 30 frames pending... and one "got buffer" roughly corresponds to one "passed buffer". But sometimes there are multiple "passed buffer", and especially after some time there are usually two or more "passed buffer" per "got buffer", having multiple hundred frames internally queued.

As video.c is just passing 80k per buffer and not a single frame per buffer this is not really realistic, but in my application where I actually pass exactly one encoded frame per input buffer exactly the same can be observed.

This also has the side effect that output is produced too slow in my application because I actually have to pass multiple frames to the decoder for every output frame. Something seems to be wrong with the algorithm on the firmware side that should keep the internally queued frames more or less constant.

sdroege commented 11 years ago

Note that in test.h264 only very few encoded frames are larger than 80k, most are around 30k-40k so there would even be multiple encoded frames passed to the component in one buffer but only one returned on average.

sdroege commented 11 years ago

Or even better this one: http://pastebin.com/eVvSGN70

Note the #if 1 in line 101. If that code is there (i.e. the EGLImage buffer is never returned to the renderer) video_decode will happily accept all frames until EOS (i.e. never in this example). If that code is not there (#if 0) you'll see that the number of pending/queued frames increases over time.

sdroege commented 11 years ago

Similar with video_decode: http://pastebin.com/8BW7nWm6

Here it blocks if the #if 0 line is changed to #if 1, after about 40 buffers passed to the component (which is already a bit too high). With the #if 0 kept it will gradually queue more and more frames over time.

In any case, it's far worse with egl_render but I think there's a general problem with this logic in the firmware.

julianscheel commented 11 years ago

Again I am highly interested in this as well. It seems to be the last showstopper to get proper EGLImage/omx support for raspberry pi into gstreamer, which would be a huge improvement for the media-framework situation on raspberry.

popcornmix commented 11 years ago

Some comments from OpenMax guy 2:

http://pastebin.com/8BW7nWm6: • Slight error when not actually emptying the buffer between the call to ilclient_get_output_buffer and the call to OMX_FillThisBuffer. The client code should zero nFilledLen and nFlags, so the buffer is actually empty before being refilled. • Also slight error in assuming the input packet size is 80kb, this should be a GetParameter call to find out the value rather than hardcoded, but I appreciate this is debug code. • Is the complaint that the variable queued increases without bound? This value is the number of input packets sent minus the number of decoded frames received. Unless we can guarantee that we have one encoded frame per input packet there is no reason why queued should stay around zero. As it is, we just feed in 80kb per input buffer and ignore any framing.
• The comment that says that if we stop pulling output buffers out of the decoded picture queue then the application blocks – this sounds exactly right, since the decoder has no-where to put the decoded pictures. So this is expected behaviour.

The comment ‘As video.c is just passing 80k per buffer and not a single frame per buffer this is not really realistic, but in my application where I actually pass exactly one encoded frame per input buffer exactly the same can be observed.’ I’m not convinced, since there really is a fixed number of input buffers, and there will be a fixed number of output buffers.

At the start ‘video_decode is currently able to internally queue thousands of frames. What I mean with this is that you could have thousands of frames passed to the input port (which are immediately copied to the videocore memory) before even outputting a single frame.’ Again, I’m not convinced. If the frames don’t emit an output buffer then they will be discarded, so having the appearance of queuing lots of frames internally. If this is the situation, then they are being passed to the decoder, but aren’t resulting in a decoded frame – typically the stream is corrupt, or (more likely) the start code/length code has been misconfigured (or the SPS/PPS is inaccurate or missing).

At one point in the past (unsure if this is still true) frames can be swallowed by the decoder into internal codec fifo memory, so the input buffer can be returned before the output is produced, effectively increasing the amount of buffering available for encoded bitstreams. The size used to be around 1mb, so if frames are small then many can be buffered internally. – is this true on the version of the codec that RPi has?

And from openmax guy 1.

Yes, the problem here is that the codec does still have a big fixed-size internal buffer into which it copies the input OMX buffers after doing some lightweight processing on them (i.e. emulation prevention removal). So the input OMX buffers get copied and released almost straight away to the client.

There actually is some code in video_decode.c which provides some rate limiting but only as a side effect of the timestamp management. We need to keep timestamps (and other metadata) associated with coded frames up until frames get decoded and fed back to us and we do that by having a fixed size array of timestamp slots (we currently have 96 of them). So if your input buffers have timestamps associated with them, you will only be able to queue up to 96 of them inside the codec.

So I would suggest that sdroege should really pass timestamps with his frames for that reason and also because timestamps are actually there for a reason… to keep AV sync going properly.

We could do rate limiting even when no timestamps are fed but that’s a bit dangerous because we don’t necessarily know how much of a full frame an input OMX buffer contains. For instance in the case of MPEG TS you could decide to feed TS packets (180ish bytes) to the decoder at a time and then suddenly you do want to be able to queue loads of input buffers inside the codec.

And from me:

So, you cannot exhaust GPU memory by feeding it too many frames. There is a fixed size input buffer used by the decoder hardware that is used in addition to the allocated openmax input buffers, but once they are full things will block.

sdroege commented 11 years ago

In my code I'm passing exactly one frame per input buffer and also put correct timestamps on all of them. Adding all that to the hello_video examples would be quite some work though.

I agree that there seems to be some limiting in the case of only using video_decode (seems to get to around 60 frames at max) but when using tunneling to egl_render this limiting is completely gone: http://pastebin.com/eVvSGN70 (especially see line 101)

The comments OpenMAX guy 2 had are valid but that's how hello_video did it, in my code it is exactly as he says (nFilledLen set to 0, one frame per input buffer, querying of the input port sizes). For everything else, the internal FIFO that exists in the firmware is exactly what I've experienced, yes. It runs full after some time when using only video_decode, but it won't when having egl_render tunneled to video_decode as in http://pastebin.com/eVvSGN70 (just check line 101 and see what happens).

The case with egl_render is currently the main problem, the ~60 frames that are queued with video_decode alone are acceptable and also the increase isn't as big there.

sdroege commented 11 years ago

For video_decode-only I seem to get up to 125 queued frames (not buffers) but it seems to depend a bit on the movie. It's not that much of the problem though, it seems clearly limited at some number. The real problem is with egl_render as mentioned above.

sdroege commented 11 years ago

Could it be that egl_render only ever has the very latest frame available, and if there happens to be a free output buffer it puts it in there... otherwise this frame would disappear when the next frame comes in?

Below some log output from the egl_render / hello_videocube that only gets a single buffer from egl_render but continues to push in new buffers. What is interesting here is the "dropping input image" part, which happens when everything works fine too but there a OMX_IndexConfigBrcmEGLImageMemHandle and output of an EGLImage happens for every single "dropping input image".

2916705.414: video_decode:54:EmptyThisBuffer(f473040,81920,0->130) 2916705.442: video_decode:54: queueBuffer (input : 81920) 2916705.464: video_decode:54:qb RIL:20 2916705.527: video_decode:54:RIL: extract from fifo: len:81920 flags:100 time:0 2916705.561: video_decode:54:RIL: decoding 81920 bytes, 0us, flags 100, (cb:0x0) 2916705.590: video_decode:54:RIL: asking for more 2916717.579: video_decode:54:RIL:cb:entered with buffer = 0xf474788, frame_info = 0xf481da8, status = 0, cb= 0x0 2916717.663: video_decode:54:RIL:cb:B time:0, cb:0x0, ar:0/0, crop:0x0 1920x1080, interval:0, interlace:0,0,0 2916717.726: video_decode:54:RIL:cb:entered with buffer = 0xf474708, frame_info = 0xf481f70, status = 0, cb= 0x0 2916717.753: video_decode:54:RIL:output dec_frames_full:f470cac out:0 2916717.829: video_decode:54:RIL:cb:P time:0, cb:0x0, ar:0/0, crop:0x0 1920x1080, interval:0, interlace:0,0,0 2916717.863: video_decode:54:RIL:unknown time to 8100000 2916717.971: decode: display frame 0f474788 T=8100000 sts=0 2916717.998: egl_renderRIL:display image(f474788) state 4 2916718.017: egl_renderRIL:display image, dropping input image(f474688) 2916718.066: video_decode:54:RIL: disp_cb passed f474788/8100000/0x0 2916718.102: video_decode:54:RIL:output dec_frames_full:f470d10 out:0 2916718.158: video_decode:54:RIL:unknown time to 8133333 2916718.215: decode: display frame 0f474708 T=8133333 sts=0 2916718.250: egl_renderRIL:display image(f474708) state 4 2916718.269: egl_renderRIL:display image, dropping input image(f474788) 2916718.311: video_decode:54:RIL: disp_cb passed f474708/8133333/0x0 2916718.746: video_decode:54:cb:consumed_input_buffer(f4730c0,0b,0f) RIL:20 2916719.133: egl_renderRIL:get_output_buffer returning NULL 2916719.168: egl_renderRIL:return omx buffers:free 0 2916719.203: egl_renderRIL:get_output_buffer returning NULL 2916719.222: egl_renderRIL:return omx buffers:free 0

cybin commented 11 years ago

Please get this issue solved. I wouldn't mind to spend some time to support anyone, even though I'm far from being familiar with that GL stuff.

I'm waiting for full gstreamer playback on RPi too, so please keep up the pace on this.

julianscheel commented 11 years ago

@popcornmix: Can you confirm whether you can follow sdroege's explanation? Any further thoughts? Or do you need another testcase or similiar?

popcornmix commented 11 years ago

New bug logged in our bugtracker:

egl_render needs a non-discard mode for input buffers

egl_render currently handles input buffers using the function er_display_image. It has one slot for the next input buffer to use (pending_image), if a second image arrives when that is full it will release this image and store the new one.

The aim of of this is that egl_render is designed to be used after video_scheduler, so that it is presented frames at display time. If the client/gl rendering is not keeping up with the presentation rate of the video, then we want to discard frames rather than delay them arbitrarily, since this would destroy AV sync.

However, if we tunnel video_decode directly to egl_render and expect to extract each frame as an egl image then we will be disappointed. Since the decoder probably works much faster than the client is supplying images, we will end up discarding large number of frames.

egl_render needs to have a client-selectable behaviour, when it either discards frames (as now) or it buffers up frames for conversion. The client would set some parameter to change which mode the component acts in.

Since this component accepts OpenMAX IL buffers and VC_IMAGE_T types directly, the buffering schemes needs to work with both. Given these are decoded buffers, and VC_IMAGE_T doesn't have a next pointer to construct a linked list, it is probably acceptable to have a fixed size queue of image handles, and then to discard frames if this queue is full. Potentially the parameter to select this behaviour could specify the length of the queue, with a sensible default value (32?).

julianscheel commented 11 years ago

@popcornmix Thanks for the update. Can you give a rough timeframe in which an update can be expected? We have a major project (with a whole lot of raspberrys) pending for this issue and have major time pressure now.

sdroege commented 11 years ago

Thanks, I don't think having a fixed size queue here and dropping again after this queue is full is a real solution here. It will only let the problem happen later.

Ideally egl_render would block if this queue is full and before it accepts any new frames.

popcornmix commented 11 years ago

video_decode has a fixed number of output frames (something like 3+number of frames in DPB) and blocks when they are all allocated.

Setting the queue length larger than that should give the blocking behaviour you require.

popcornmix commented 11 years ago

@julianscheel It's out of my hands. It is classified as "Major" and has someone assigned to it, so it's probably days rather than weeks, but depends on the priority of other bugs in the system.

julianscheel commented 11 years ago

@popcornmix Any news on this yet? In case there is any experimental firmware you could provide for our internal testing and development we would be more than happy.

popcornmix commented 11 years ago

Broadcom openmax guy was testing a fix on the Pi just now. Not working yet, but a test version tomorrow may be possible.

cybin commented 11 years ago

Great news.

popcornmix commented 11 years ago

https://dl.dropbox.com/u/3669512/temp/start_egl_block.elf

This is a fixed 128M GPU/128M arm memory split (of the start_x variety). Remove /boot/fixup.dat before running.

To disable discard mode, the following lines will need to be added to video.c in the hello_videocube test:

#define OMX_IndexParamBrcmVideoEGLRenderDiscardMode 0x7f0000dc
                OMX_CONFIG_PORTBOOLEANTYPE discardMode;
                memset(&discardMode, 0, sizeof(discardMode));
                discardMode.nSize = sizeof(discardMode);
                discardMode.nPortIndex = 220;
                discardMode.nVersion.nVersion = OMX_VERSION;
                discardMode.bEnabled = OMX_FALSE;
                OMX_SetParameter(ILC_GET_HANDLE(video_render), OMX_IndexParamBrcmVideoEGLRenderDiscardMode, &discardMode);

It's not fully tested, but is provided on a "it could work" basis.

sdroege commented 11 years ago

Seems to work much better, at least no frames are lost anymore. But it seems that frames come in bursts, for some time nothing is output at all, then suddenly a lot of frames... and sometimes it stops completely. It's definitely an improvement :)

julianscheel commented 11 years ago

@popcornmix Can you reproduce the bursty behaviour sdroege describes? Is there anything more we can do to assist or speed this up?

popcornmix commented 11 years ago

The bursty behaviour wasn't obvious when running hello_videocube on Friday. If you can demonstrate it using hello_videocube that would help.

sdroege commented 11 years ago

Nevermind, that was just caused by my network being to slow to transfer my test file in real time. Everything's fine there :)

But I noticed a real problem now that I spent more than 10 minutes on it. It seems that video_decode still accepts data "as fast as possible", egl_render outputs frames with timestamps corresponding to all input frames... but the content of the EGLImages does not correspond to the timestamp, instead it's the content for a much later time. For example I can cause a situation here where the current input frame to video_decode has a timestamp of about 1 minute, the current output timestamp of egl_render is at about 10 seconds... and the content of that EGLImage corresponds to some frame shortly before 1 minute.

Will try to somehow reproduce this with hello_videocube

popcornmix commented 11 years ago

https://dl.dropbox.com/u/3669512/temp/start_egl_block.elf has been updated with a fix for memory corruption. Don't think it will affect the timestamp issue.

sdroege commented 11 years ago

@popcornmix No it doesn't fix the timestamp/frame-content inconsistencies. Here's a testcase based on hello_videocube to reproduce that: http://pastebin.com/SyRiGBVw

It's slowing down the calls to OMX_FillThisBuffer() to one call per second, which should just let the video happen in slow motion. What it does is that the video is played in normal speed but only one frame is displayed every second.

julianscheel commented 11 years ago

@popcornmix Can you reproduce the timestamp issue with the modified hello_videocube?

popcornmix commented 11 years ago

Yes. It's my mistake. Can you try with:

define OMX_IndexParamBrcmVideoEGLRenderDiscardMode 0x7f0000db

julianscheel commented 11 years ago

@popcornmix So far in our gstreamer plugin the behaviour does not seem to change, no matter if Discard is set to OMX_TRUE or OMX_FALSE. Is it required to do this setting at a certain point of time? Before creating the egl_render element? Or between creation and tunnel setup?

popcornmix commented 11 years ago

Does it fix the hello_videocube example? It's a property of egl_render, so must be sent after elf_render is created, but before it is put into executing state I'm imagine. In the example code it is set immediately after creation of egl_render.

julianscheel commented 11 years ago

@popcornmix I will check that tomorrow morning. I haven't built the videocube example yet and sdroege is out of office for a few days. I will give you feedback as soon as I have built it and tested with old vs. new firmware.

One more question: Setting Discard mode to OMX_TRUE would make the firmware behave like the old one did?

popcornmix commented 11 years ago

Yes

julianscheel commented 11 years ago

I have tested the videocube testcase now and that one indeed seems fine now. So it seems we are facing another problem in the gstreamer implementation. I will discuess with @sdroege what else could cause issues there.

julianscheel commented 11 years ago

Ok, we got the gstreamer part sorted, so that we get smooth playback for files now. @popcornmix Would you expect memory bandwidth limitations compared when using egl rendering compared to omx rendering? When playing back network streams with around 10Mbit/s and 720p resolution we see almost entirely garbled streams. Playing the same content from a file is fine. Playing the content from a file while the file is fed from a network stream causes the same impact as direct stream playback, which let me think about some bandwidth limitation we might hit? With omxplayer the same streams can be played, though.

popcornmix commented 11 years ago

Does it look like the display output is breaking up (e.g. though fifo underrun)? Or packets being loss causing decoder artifacts?

If the former, then the HVS by default does format conversion/resize/composition on the fly, and when memory bandwith is very high it can underrun. Adding "dispmanx_offline=1" to config.txt makes the HVS work in an offscreen buffer mode which may fix this.

The latter sounds like software bug somewhere. Is ARM 100% busy? Does overclocking help?

julianscheel commented 11 years ago

I would say it's actually packets dropping. Audio is affected as well. But I will try the dispmanx_offline option and see if it improves things.

julianscheel commented 11 years ago

CPU is not at 100%, but when the artefacts start it raises to ~95%, so it's quite high then. Playing the same video from file is at about 60%.

popcornmix commented 11 years ago

I would expect lost packets to be a host (gstreamer) issue. I don't believe GPU drops (input) packets under any circumstances. When running behind it may drop ouput frames (with presentation time earlier than current time) and it may adjust clock if audio packets are significantly late (resulting in slowed down video).

What's the network connection? NFS is significantly more performant than SAMBA. Carefully chosen NFS mount options can help. E.g. I use:

sudo mount 192.168.4.9:/Public -o _netdev,nfsvers=3,rw,intr,noatime,rsize=32768,wsize=32768,nolock,async,proto=udp /home/pi/dell
julianscheel commented 11 years ago

The data was not read from a file share but from a rtsp/udp stream. But I will try with an nfs share tomorrow, to see if the general throughput is fine. I will give you an update as soon as I have tested this.

popcornmix commented 11 years ago

rtsp/udp will drop packets when the ARM is too busy to keep up. TCP/nfs will drop packets when busy, but handle retransmission, so throughput may be reduced but data won't be lost in network stack and you shouldn't get corrupt video (although it may not keep up).

sdroege commented 11 years ago

Yeah that's actually a good point. From what I saw the new firmware and the discard=false mode works perfectly fine now, there's just a performance problem somewhere and that seems to be limited by the CPU.

julianscheel commented 11 years ago

Actually we are forced to use UDP in this case. But as omxplayer is able to receive and playback the same stream I assume that in general the system is able to cope with it. What's interesting is that even when only decoding audio with our gstreamer solution it starts to drop out after 1 or 2 seconds, which makes me doubt that there is a general CPU load issue, but some other limitation. I will do some more tests now and let you know how it goes.

sdroege commented 11 years ago

As this here seems to work now with the new firmware, when is it expected that it will be included in the official firmware (or the start_x.elf variant at least)?

popcornmix commented 11 years ago

It's in the "rpi-update" firmware. There's been some useful fixes to USB and audio, so I'd imagine there'll be a new official image and "apt-get" firmware update soon. Certainly within the next month.

sdroege commented 11 years ago

Ah great, didn't notice it was in rpi-update already :) Thanks!