Decoder external buffer group optimizations

hbiyik commented 11 months ago

https://github.com/rockchip-linux/mpp/blob/d127b5c78b96beb57baca76758834e12d17dfe7c/readme.txt#L391

I am using external buffers for my decoder. It really boosts the performance a lot. Above states that 20+ is necessary to satisfy h264/5 decoder. Is this statement up to date?

When using hdr 8k files to decode, this requirement causes time to time OOMs, is there a way to optimize this? Here is a simple calculation: 7680x4320 NV15 frame is approx = 75MB 7680x4320 NV15_Uncompact frame is approx = 100MB (I convert with RGA Uncompact so i need to allocate P010) 100 * 20 = 2GB! only 1 HDR file, thats too much..

Also when there is not enough buffers commited, i get a deadlock when requesting a buffer. Is this expected or should get an error returned?

HermanChen commented 11 months ago

The FBC mode should be enabled on 8K decoding otherwise the ddr bandwidth will be not enough. On 3588 with two cores enabled the performance can reach 8K@60fps.

It seems the slow speed decoding request more buffer or it is a memory leak in certain stage?

hbiyik commented 11 months ago

@HermanChen thanks for the explanations. May be i was not specific and i did not explain it good enough.

I use RK3588, vdpu381 HEVC dual core decoder decoder.

My flow as below (pseudo code)

MINIMUM_REQUIRED_BUFFER_COUNT=20
mpp_buffer_group_get_external(buffer_group, MPP_BUFFER_TYPE_DMA_HEAP);
for(i=0, i<MINIMUM_REQUIRED_BUFFER_COUNT){
     mpp_buffer_commit(*buffer_group, P010_buffer_size(100MB));
}
mpi->control(ctx, MPP_DEC_SET_EXT_BUF_GROUP, buffer_group);

for each frame{
     mpi->decode_get_frame(ctx, &mppframe);
}

with that flow, hevc decoder can do 8k@60 even more than 60 depending on the file, no problem here.

The problem for me is, 20*100MB buffer size requires too much memory, and when i also use a gnome desktop environment + a browser like firefox, i can easily get Out of Memory.

What i kindly ask is may be there is a way to reduce this MINIMUM_REQUIRED_BUFFER_COUNT=20 something lower like 10.

If i make it 10, decode_get_frame get deadlock. I think thats another problem.

HermanChen commented 11 months ago

20 100MB buffer can be reduce to 10 100MB. The dpb size of 8K video can only support a few number of frames. 10 or even less buffer is possible but it depends on the stream syntax.

We can set control MPP_SET_OUTPUT_TIMEOUT to be non-block mode or proper timeout then decode_get_frame will return if there is no frame to output.

hbiyik commented 11 months ago

@HermanChen i moved to FBC output and a lot of my problems are solved. And dual core h264/h265 decoders are really amazing with FBC, they can hit fps numbers like 120fps in 8K 10bit HDR files. So thats really nice.

I can mange the y offset on the output frame, but AV1 decoder output is weird, i did not dig deeper into it to analyze whats wrong with them, do you have any tips how to handle AFBC output on AV1 decoder output (vdpu981).

Also i was expecting FBC buffer size to be around half of original yuv420 size=(hstride(vstride+offsety)(approximately)) 1,5 I checked the mppframeimp_t* of actual frame, there are 2 sizes, one actual size, and fbc_size, they are both same, and around actual yuv420 size. So is the information that compressed buffer would be half the size is wrong? Or the info you provided are very safe values because you do not know exactly how much the compression is?

example nv12 afbc frame size for 1920*1080

I am asking those because i am trying to memory optimize the external buffers, if i can reduce the buffer size it is much better.

Thanks for your support.

HermanChen commented 11 months ago

The FBC buffer is composed by two parts: the header and the body. On the worst case the body part will be equal to wh3/2 (the yuv420 size). With extra header buffer the total buffer should be larger than yuv420 size. The FBC mode can only save the ddr bandwidth and it can not save the buffer size for it must take the worst case into consideration.

hbiyik commented 11 months ago

The FBC mode can only save the ddr bandwidth and it can not save the buffer size for it must take the worst case into consideration.

So this means that buffer is alloced in the frame size, but the hardware will use less of this size according to the compression, therefore this would reduce the bandwith, did i get it correct?

hbiyik commented 11 months ago

@HermanChen also any tips on AV1?

hbiyik commented 11 months ago

@HermanChen i think i understand whats going on with AV1 decoder. it smells like a bug.

The mppframe output for infochange does not contain y offset and hstride. Whenever first actual frame is decoded, i get a different hstride and yoffset for AFBC. I allocate the external buffers according to the output of infochange, so i get a wrong planesize and offset.

do you need anything to reproduce this?

here is an example for 1920x818 AV1 frame infochange frame

actual first frame

HermanChen commented 11 months ago

The FBC mode can only save the ddr bandwidth and it can not save the buffer size for it must take the worst case into consideration.

So this means that buffer is alloced in the frame size, but the hardware will use less of this size according to the compression, therefore this would reduce the bandwith, did i get it correct?

Yes, correct.

@HermanChen also any tips on AV1?

About AV1 it seems that there is a bug on AV1 FBC mode so it is better to disable the AV1 FBC output.

HermanChen commented 11 months ago

We will check it is sw issue or hw issue.

hbiyik commented 10 months ago

@HermanChen After further investigation, the yoffset of av1 afbc frame is actually flip flopping between 0 and 8 px.

If this was a hardware behaviour, that would be a very weird hardware.

FumasterLin commented 10 months ago

Yeah, i think it weird too.

rockchip-linux / mpp

Decoder external buffer group optimizations #453