raspberrypi / linux

Kernel source tree for Raspberry Pi-provided kernel builds. Issues unrelated to the linux kernel should be posted on the community forum at https://forums.raspberrypi.com/
Other
11.03k stars 4.95k forks source link

kernel oops in v4l - playing video in kodi twitch add-on #4831

Open greatquux opened 2 years ago

greatquux commented 2 years ago

Describe the bug

While playing video in the Twitch add-in, if the add-in needs to play an ad mid-stream, it has been reliably crashing and causing a kernel Oops. This is the latest kodi from raspbian bullseye archives, and regular 32-bit armhf raspbian bullseye, on pi 400.

Steps to reproduce the behaviour

install kodi and twitch add-in, use hardware acceleration (which works great for all other videos and the youtube add-in, and certainly doesn't cause kernel crashes!); play any channel you are not subscribed to which should eventually cause ads, I got the crash documented in the logs while viewing the mst3k channel https://www.twitch.tv/mst3k

Device (s)

Raspberry Pi 400

System

pi@raspberrypi ~> cat /etc/rpi-issue Raspberry Pi reference 2020-08-20 Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 9a3a10bf1019ebb2d59053564dc6b90068bad27d, stage5

pi@raspberrypi ~> vcgencmd version Nov 18 2021 16:16:49 Copyright (c) 2012 Broadcom version d9b293558b4cef6aabedcc53c178e7604de90788 (clean) (release) (start)

pi@raspberrypi ~> uname -a Linux raspberrypi 5.10.63-v7l+ #1488 SMP Thu Nov 18 16:15:28 GMT 2021 armv7l GNU/Linux

Logs

crash logs were retrieved with command journalctl -o short-precise -k -b -2 dmesg_logs_twitch_crash.log

the Oops info was displayed on an SSH section I had connected at time of crash:

pi@raspberrypi ~> 

Message from syslogd@raspberrypi at Jan 13 21:06:04 ...
 kernel:[82830.822613] Internal error: Oops: 207 [#1] SMP ARM

Message from syslogd@raspberrypi at Jan 13 21:06:04 ...
 kernel:[82830.881009] Process kworker/3:2 (pid: 6879, stack limit = 0xb4db5dde)

Message from syslogd@raspberrypi at Jan 13 21:06:04 ...
 kernel:[82830.883769] Stack: (0xc641bef8 to 0xc641c000)

Message from syslogd@raspberrypi at Jan 13 21:06:04 ...
 kernel:[82830.886455] bee0:                                                       c032f30c c020cb1c

Message from syslogd@raspberrypi at Jan 13 21:06:04 ...
 kernel:[82830.889198] bf00: efecd6c0 efecd6c0 efecd6c0 c6764380 c6764394 efecd6c0 00000008 efecd6d8

Message from syslogd@raspberrypi at Jan 13 21:06:04 ...
 kernel:[82830.891940] bf20: c1203d00 efecd6c0 c641bf74 c641bf38 c023df18 c023d924 c0b93864 c0e221a0

Message from syslogd@raspberrypi at Jan 13 21:06:04 ...
 kernel:[82830.894770] bf40: c133d1de ffffe000 c641bf74 c665a280 c5084400 00000000 c641a000 c023deb8

Message from syslogd@raspberrypi at Jan 13 21:06:04 ...
 kernel:[82830.897470] bf60: c6764380 c5137e74 c641bfac c641bf78 c02459f8 c023dec4 c665a2a4 c665a2a4

Message from syslogd@raspberrypi at Jan 13 21:06:04 ...
 kernel:[82830.900288] bf80: ffffe000 c5084400 c0245888 00000000 00000000 00000000 00000000 00000000

Message from syslogd@raspberrypi at Jan 13 21:06:04 ...
 kernel:[82830.903020] bfa0: 00000000 c641bfb0 c02000ec c0245894 00000000 00000000 00000000 00000000

Message from syslogd@raspberrypi at Jan 13 21:06:04 ...
 kernel:[82830.905723] bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000

Message from syslogd@raspberrypi at Jan 13 21:06:04 ...
 kernel:[82830.908433] bfe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000

Message from syslogd@raspberrypi at Jan 13 21:06:04 ...
 kernel:[82830.911183] Backtrace: 

Message from syslogd@raspberrypi at Jan 13 21:06:04 ...
 kernel:[82830.913868] [<c023d918>] (process_one_work) from [<c023df18>] (worker_thread+0x60/0x5c4)

Message from syslogd@raspberrypi at Jan 13 21:06:04 ...
 kernel:[82830.916562]  r10:efecd6c0 r9:c1203d00 r8:efecd6d8 r7:00000008 r6:efecd6c0 r5:c6764394

Message from syslogd@raspberrypi at Jan 13 21:06:04 ...
 kernel:[82830.919456]  r4:c6764380

Message from syslogd@raspberrypi at Jan 13 21:06:04 ...
 kernel:[82830.922178] [<c023deb8>] (worker_thread) from [<c02459f8>] (kthread+0x170/0x174)

Message from syslogd@raspberrypi at Jan 13 21:06:04 ...
 kernel:[82830.924972]  r10:c5137e74 r9:c6764380 r8:c023deb8 r7:c641a000 r6:00000000 r5:c5084400

Message from syslogd@raspberrypi at Jan 13 21:06:04 ...
 kernel:[82830.927702]  r4:c665a280

Message from syslogd@raspberrypi at Jan 13 21:06:04 ...
 kernel:[82830.930399] [<c0245888>] (kthread) from [<c02000ec>] (ret_from_fork+0x14/0x28)

Message from syslogd@raspberrypi at Jan 13 21:06:04 ...
 kernel:[82830.933193] Exception stack(0xc641bfb0 to 0xc641bff8)

Message from syslogd@raspberrypi at Jan 13 21:06:04 ...
 kernel:[82830.935979] bfa0:                                     00000000 00000000 00000000 00000000

Message from syslogd@raspberrypi at Jan 13 21:06:04 ...
 kernel:[82830.938741] bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000

Message from syslogd@raspberrypi at Jan 13 21:06:04 ...
 kernel:[82830.941508] bfe0: 00000000 00000000 00000000 00000000 00000013 00000000

Message from syslogd@raspberrypi at Jan 13 21:06:04 ...
 kernel:[82830.944312]  r10:00000000 r9:00000000 r8:00000000 r7:00000000 r6:00000000 r5:c0245888

Message from syslogd@raspberrypi at Jan 13 21:06:04 ...
 kernel:[82830.946986]  r4:c5084400

Message from syslogd@raspberrypi at Jan 13 21:06:04 ...
 kernel:[82830.949601] Code: e1a05000 e2137004 13c370ff e5963010 (e5972004) 
Timeout, server raspberrypi.local not responding.

Additional context

No response

popcornmix commented 2 years ago

The oops backtrace doesn't provide any information but the dmsg log indicates there has been some non-fatal unhappiness (WARNs) in video driver ending with a more fatal null dereference.

If you disable hw acceleration I assume it plays okay?

I've tried installing twitch addon, and it is playing mst3k channel. How long is it usually before crashing? It's been playing about 25 mins so far without a crash.

greatquux commented 2 years ago

I haven't tried to disable h/w acceleration just yet, and I do want to confirm this continues to occur in the latest kernel just released a few days ago. I just picked the mst3k channel because it had mid-roll ads (which were what was crashing). I will try to do some more tests tonight.

popcornmix commented 2 years ago

just picked the mst3k channel because it had mid-roll ads

How frequent were the ads? I think I stopped it after about 40 minutes (where it was still running).

greatquux commented 2 years ago

I had an issue with the Twitch plugin still displaying ads on channels I'd subscribed to (including that one) but I've since fixed that by putting the right info in my addon's XML file. So I was jumping around last night (from around 22:10 to 22:17) trying to find a channel to display ads. I finally left it on loltyler1 (at 22:17) and it seems like it took about 40 minutes (coincidentally) to give me the Oops again (at 22:56) and then become completely unresponsive and require a hardware reboot at 23:03. So sometimes it needs to play for a while to get to one of these ads. But interestingly in the included dmesg log we start seeing WARNINGs before I settle on the one channel. twitch_prob_dmesg.log

I will try to turn off acceleration today and see if I can just leave it on a channel for a long time without crashing to definitely say it's related to hardware acceleration.

popcornmix commented 2 years ago

I will try to turn off acceleration today and see if I can just leave it on a channel for a long time without crashing to definitely say it's related to hardware acceleration.

I expect it will be related to hw acceleration, but always good to be sure. I'll leave the twitch test running longer when convenient.

greatquux commented 2 years ago

Actually I turned off h/w acceleration (allow prime) and then went into a Twitch channel, which properly displayed a pre-roll ad that looked fine (and I usually don't even see the ads), but then went the ad ended and it displayed the channel video it was all screwed up, colors wrong and video corrupted. I stopped it and turned it back on and went back into the channel (no ad this time, probably because it just displayed one) and video was normal, so I'm leaving it on there since it seems to be unusable without acceleration.

popcornmix commented 2 years ago

I've just tried all combinations of "Allow using DRM PRIME decoder" and "PRIME Render Method" and all seemed to start playing video okay. I didn't seem to start with an ad though. Can you grab a debug enabled kodi log? If it's failing without hardware acceleration, then it's probably a kodi issue. Also might be worth testing the same version of kodi (is this 19.3?) on another platform (e.g. windows or linux on PC).

greatquux commented 2 years ago

Do you want me to get the debug log with acceleration on or off? I didn't run it long enough to see if it crashes the kernel without acceleration because it just wasn't displaying correctly for me (if you want me to take a picture let me know).

popcornmix commented 2 years ago

If it's going wrong without acceleration then a log without acceleration should be fine. A picture could be useful too.

greatquux commented 2 years ago

Ok, I have attached a debug log playing Twitch for a while (hour?) without acceleration. No hangs, but corrupted video display. kodi_debug_log_no_hwaccel.log Accel settings: 2022-01-25 12 20 16 Example of corrupted video display (pre-roll ad): 2022-01-25 12 21 00 Example of corrupted video display (channel): 2022-01-25 12 21 23

I can try to get a debug log later on with acceleration on (at least, before it dies) but a kid is coming home from school now and will need to watch his favorite movie so this box is going to go into production mode (haha) until he goes to bed.

popcornmix commented 2 years ago

Hmm photos don't look like a decode error, but more an alignment error of luma/chroma. I wonder if stream changes resolution and that isn't handled properly. Can you test with "PRIME Render method" set to "EGL"?

greatquux commented 2 years ago

I left "Allow using DRM PRIME Decoder" off and changed PRIME Render Method to EGL but there was no change in the display, it showed the same corruption. I turned Allow using DRM to on, and left render method at EGL, and the video was displayed properly then (I quit the stream before I could run into the crashing issue and reset render method back to Direct to Plane).

popcornmix commented 2 years ago

Also note there are two options for acceleration "Allow using DRM decoder" and "Allow hw accel with DRM PRIME" which may behave differently. The "DRM decoder" means we use DRM buffers so can decode and display with zero copy. In addition to that the decoder can be SW (ffmpeg) or HW (V4L2).

greatquux commented 2 years ago

Ah I get it now. Later tonight I will test with "Allow using DRM decoder" ON, but "Allow hw accel" OFF and see what happens (and with direct and EGL). In terms of the crash though it's likely some V4L2 bug I guess?

greatquux commented 2 years ago

With "allow hw accel" OFF but "allow using DRM" ON, Twitch looked fine, of course it used a lot of CPU, but did not give me any oops or crash kernel, left it running all night. EGL would work for a little while then run into problems with stuttering (with other videos too, even if hwaccel was on), Direct to Plane works just fine (with hwaccel on or off). So the kernel crashing issue does appear to be related to V4L2 hardware acceleration.

popcornmix commented 2 years ago

I suspect the colour being wrong and kernel crashing are different symptoms of the same problem (not handing a change in dimensions correctly). So I'm interested in which setting are: 1) perfect 2) no crash but colours corrupted 3) kernel crash

Also when colour go wrong, can you check the OSD info ('o' key) for video stream resolution. Does it change when it goes wrong.

greatquux commented 2 years ago

Ok sorry it took so long but I was able to run a bunch of tests finally. Combinations of settings are described below. "Only Allow h/w accel w/ HEVC" is OFF for all settings below.

  1. perfect - the perfect settings that result in colors rendered correctly, low CPU, and no crashes (except for the crash reported above which only seems to occur in the Twitch plugin after a period of time, probably related to mid-roll ads) are: Allow using DRM Prime ON; Allow h/w acceleration with DRM PRIME ON; Prime Method Direct to Plane I confirmed this again this morning just leaving Twitch on some random channel that I figured would play ads and in about 10-15 minutes the kernel crash occurred (although I wasn't in debug mode or connected to get any log or oops).

If willing to accept more CPU usage, then Allow using DRM Prime ON; Allow h/w acceleration with DRM PRIME OFF; Prime Method Direct to Plane will also render colors correctly and not crash. (ran a Twitch channel for about 45 min)

  1. no crash but colors corrupted - anytime I set Allow using DRM Prime OFF I will get corrupted colors, no matter the other settings. And this is with a regular H.264 movie, images are shown below.

  2. kernel crash - I have only gotten the kernel to crash as described in the initial report when using the otherwise "perfect" settings. I have gotten kodi to be unusable within a few minutes whenever I use Prime Method EGL Kodi just stops rendering, becomes almost unusable until I stop the video, and then doesn't work correctly until I reboot. No matter the other settings, EGL always gets into that state within a few minutes of playing any video (I tested with just a regular H.264 movie).

IN regards to your question about the resolutions, there are two images below, and the video stream resolution is shown differently in both of them.

Images: Allow using DRM Prime OFF - corrupted colors: 2022-02-09 11 30 53

Allow using DRM Prime ON - colors are good: 2022-02-09 11 30 11