rockchip-linux / mpp

Media Process Platform (MPP) module
528 stars 161 forks source link

rk_vcodec: reg_init:1248: error: translate reg address failed, dumping regs #632

Open wn2000 opened 2 months ago

wn2000 commented 2 months ago

Platform: RK3328 running a buildroot based Linux. Kernel: 4.4.159

I'm working on a game selection frontend app that plays two simultaneous videos: a background video and a game preview. Both cycle through a list of video files.

Any insights or suggestions to fix this behavior or gather more information would be greatly appreciated!

HermanChen commented 2 months ago

Please check whether there is a fd leak on app. Check the opened fd in app thread is increasing or not. The most likely issue is that the fd is leaking on running. If fd increaces above 1024 then kernel driver will generate translation error.

wn2000 commented 2 months ago

Thanks for the suggestion. I will keep an eye on the fd count.

One more test I did, is to switch the game preview video to software decoding and only use mpp on the background video.

That seems to have stabilized the app. It's been running for over 24 hours without an issue.

Do I need to apply some thread locking when two threads are accessing mpp at the same time? They have separate mpp contexts.

HermanChen commented 2 months ago

No need to lock between different mpp context. All mpp contexts are all indepent to each other.

wn2000 commented 1 month ago

I re-enabled mpp on both videos and monitored the opened fd count.

Within an hour the error appeared again. But the fd count is not high (<100).

I did notice that before the translate reg address failed error, there were these errors in dmesg:

...
[142488.097327] rk-vcodec ff360000.rkvdec: can not find 3348 buffer in list
[142488.157297] rk-vcodec ff360000.rkvdec: can not find 3345 buffer in list
[142488.157912] rk-vcodec ff360000.rkvdec: can not find 3345 buffer in list
[142488.158523] rk-vcodec ff360000.rkvdec: can not find 3289 buffer in list
[142488.159121] rk-vcodec ff360000.rkvdec: can not find 3289 buffer in list
[142488.159749] rk-vcodec ff360000.rkvdec: can not find 3296 buffer in list
[142488.160346] rk-vcodec ff360000.rkvdec: can not find 3296 buffer in list
[142488.160974] rk-vcodec ff360000.rkvdec: can not find 3345 buffer in list
[142488.161578] rk-vcodec ff360000.rkvdec: can not find 3345 buffer in list
[142488.226703] rk-vcodec ff360000.rkvdec: can not find 3364 buffer in list
[142488.227305] rk-vcodec ff360000.rkvdec: can not find 3364 buffer in list
...

But when those errors pop up, the videos both play fine.

It is at some point, when the rk_vcodec: reg_init:1248: error: translate reg address failed, dumping regs error happens, one of the videos would choke, while the other video still plays fine. Then when it's time for the other video to open a new file to play, that video would choke too. At that point, only a reboot can fix the problem.

It is strange though if I only use one video, there's no issue at all (tested for 2 days nonstop).

HermanChen commented 1 month ago

it is obvious a buffer leak. There is too many buffer in the list There may be buffer leak error on file switch. When one file goes to the end the decoder should input a eso packet and wait the eos ourput frame then exit.

wn2000 commented 1 month ago

Ahhh ok. That makes sense. I will check the code. Thanks!

wn2000 commented 1 month ago

One more question: When this happens, is there a way to "reset" the rkvdec device to a good state?

Currently, even if I kill the app and relaunch, it still gets stuck.

HermanChen commented 1 month ago

Close all decoder instance and free all buffer. cat /sys/kernel/debug/dma_buf/bufinfo to check any buffer remain in the list.

wn2000 commented 1 month ago

Awesome thanks! So I killed the app, which is the only app that uses the rkvdec device. But when I do cat /sys/kernel/debug/dma_buf/bufinfo, I still get numerous buffers that look like the following:

...
00004096        00000002        00000007        00000008        drm
        Attached Devices:
        ff360000.rkvdec
        ff360000.rkvdec
        ff360000.rkvdec
        ff360000.rkvdec
        ff360000.rkvdec
        ff360000.rkvdec
        ff360000.rkvdec
        ff360000.rkvdec
Total 8 devices attached

00012288        00000002        00000007        00000004        drm
        Attached Devices:
        ff360000.rkvdec
        ff360000.rkvdec
        ff360000.rkvdec
        ff360000.rkvdec
Total 4 devices attached
...

How do I "free" them?

HermanChen commented 1 month ago

Check the buffer's holder. It may be hold in both display process and rkvdec. Then the decoder work flow need to be checked for unreleased MppFrame

wn2000 commented 1 month ago

https://github.com/user-attachments/assets/0133709e-89ad-408a-ad8a-07df74480f74

I just realized that the issue is not due to playing two videos simutaneouly. It's actually caused by some particular videos.

Attached is one of such problematic videos.

For other "good" videos, when I do cat /sys/kernel/debug/dma_buf/bufinfo |grep 'Total .* devices', I get

Total 1 devices attached
Total 1 devices attached
Total 1 devices attached
Total 1 devices attached
Total 1 devices attached
Total 1 devices attached
Total 1 devices attached
Total 1 devices attached
...

which looks reasonable. And after exiting the app those dmabufs are gone.

However, when playing the attached video, I got:

Total 4 devices attached
Total 1 devices attached
Total 3 devices attached
Total 6 devices attached
Total 2 devices attached
Total 3 devices attached
Total 4 devices attached
Total 6 devices attached
Total 4 devices attached
Total 4 devices attached
Total 3 devices attached
Total 2 devices attached
Total 3 devices attached
Total 4 devices attached
Total 2 devices attached
Total 5 devices attached
Total 5 devices attached
Total 3 devices attached
Total 5 devices attached
Total 2 devices attached
Total 4 devices attached
Total 2 devices attached
Total 2 devices attached
Total 26 devices attached
Total 3 devices attached
Total 16 devices attached
Total 7 devices attached
Total 35 devices attached
Total 1 devices attached
Total 1 devices attached
Total 1 devices attached
Total 9 devices attached
Total 1 devices attached
Total 6 devices attached
Total 15 devices attached
Total 1 devices attached
Total 1 devices attached
Total 1 devices attached
Total 1 devices attached
Total 1 devices attached
Total 1 devices attached
Total 1 devices attached
Total 1 devices attached
Total 0 devices attached

And after exiting the app those dmabufs are still there.

Can you see what's unique about the attached video causing those dmabufs to leak? The video plays fine otherwise.

wn2000 commented 1 month ago

I just checked the above video with mpi_dec_test, and got the same leaking result. Here is the raw h264 stream: derbyoc2.zip

Does that mean the issue is actually in mpp?

wn2000 commented 1 month ago

@HermanChen Hi just wonder if you could reproduce the resource leak when playing the above video clip, or it's just my setup? Thanks!

HermanChen commented 1 month ago

Are you using the MPP develop branch?

wn2000 commented 1 month ago

Yea using the develop branch. Tried with mpi_dec_test. And observed the dma_buf leak using that particular h264 stream. Other streams work just fine so don't know what's unique about that one.

wenyue7 commented 1 month ago

You can upload the following files so that I can confirm the version information of your platform mpp: libmpp.so libvpu.so

kernel: kernel/drivers/video/rockchip/vpu kernel/drivers/video/rockchip/vcodec

wn2000 commented 1 month ago

Hi. Here is the mpp lib. It was built from this repository's develop brunch. librockchip_mpp.so.0.zip

I do not use the librockchip_vpu.so library (the application works fine without that library).

For the kernel, I do not have the source or development package. I'm using the stock system provided by the device manufacturer, and only run my own user-space application on top of it.

The uname -a output is:

4.4.159 #4 SMP Mon Jun 12 09:45:25 CST 2023 aarch64 GNU/Linux

The board is RK3328.

Is there any other info I could provide to help troubleshoot the problem? I guess you were not able to reproduce the resource leak using the h264 file I uploaded?

wenyue7 commented 1 month ago

Yes, I cannot reproduce the issue using the file you uploaded