raspberrypi / picamera2

New libcamera based python library
BSD 2-Clause "Simplified" License
891 stars 188 forks source link

[BUG] Potential kernel issue relating to v4l2 with kernel, which I believe affects picamera2 #985

Closed chrisruk closed 7 months ago

chrisruk commented 8 months ago

I've just edited the following bug report sorry, as I realised I'd only grepped recent journalctl info, it looks like the issue has occurred sporadically since last December at least.

Describe the bug

I noticed I didn't seem to be getting many videos from my Raspberry Pi camera running capture_circular_stream.py.

After looking in my journal, I found interesting kernel messages (listed later).

To Reproduce

Leave capture_circular_stream.py running over long time period and monitor journalctl.

Expected behaviour

No kernel issue.

Console Output, Screenshots

Mar 15 23:42:31 pi2 kernel: vc_sm_cma_vchi_rx_ack: received response 2476813, throw away...
Mar 15 23:42:31 pi2 kernel: INFO: task capture_circula:817 blocked for more than 120 seconds.
Mar 15 23:42:31 pi2 kernel:       Tainted: G         C         6.6.20+rpt-rpi-v8 #1 Debian 1:6.6.20-1+rpt1
Mar 15 23:42:31 pi2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 15 23:42:31 pi2 kernel: task:capture_circula state:D stack:0     pid:817   ppid:1      flags:0x00000004
Mar 15 23:42:31 pi2 kernel: Call trace:
Mar 15 23:42:31 pi2 kernel:  __switch_to+0xe8/0x168
Mar 15 23:42:31 pi2 kernel:  __schedule+0x380/0xd70
Mar 15 23:42:31 pi2 kernel:  schedule+0x64/0x108
Mar 15 23:42:31 pi2 kernel:  schedule_preempt_disabled+0x2c/0x50
Mar 15 23:42:31 pi2 kernel:  __mutex_lock.constprop.0+0x2ac/0x608
Mar 15 23:42:31 pi2 kernel:  __mutex_lock_slowpath+0x1c/0x30
Mar 15 23:42:31 pi2 kernel:  mutex_lock+0x50/0x68
Mar 15 23:42:31 pi2 kernel:  v4l2_m2m_fop_poll+0x38/0x80 [v4l2_mem2mem]
Mar 15 23:42:31 pi2 kernel:  v4l2_poll+0x54/0xc8 [videodev]
Mar 15 23:42:31 pi2 kernel:  do_sys_poll+0x2a4/0x5b8
Mar 15 23:42:31 pi2 kernel:  __arm64_sys_ppoll+0xb0/0x140
Mar 15 23:42:31 pi2 kernel:  invoke_syscall+0x50/0x128
Mar 15 23:42:31 pi2 kernel:  el0_svc_common.constprop.0+0x48/0xf0
Mar 15 23:42:31 pi2 kernel:  do_el0_svc+0x24/0x38
Mar 15 23:42:31 pi2 kernel:  el0_svc+0x40/0xe8
Mar 15 23:42:31 pi2 kernel:  el0t_64_sync_handler+0x100/0x130
Mar 15 23:42:31 pi2 kernel:  el0t_64_sync+0x190/0x198

Hardware : Pi 4 and Raspberry Pi Auto Focus Camera

Many thanks!

davidplowman commented 8 months ago

Hi Chris, thanks for the report. Any idea how long it needs to be running, is several days enough do you think? Do you think it's also still an issue in the latest 6.6 kernel which was generally released at the end of last week?

@lurch Andrew, I don't suppose I could tempt you into giving this one a try (pretty please)?

lurch commented 8 months ago

@lurch Andrew, I don't suppose I could tempt you into giving this one a try (pretty please)?

Sure, I'll set up a test system this afternoon.

ajkelsey commented 8 months ago

@chrisruk I'm having a similar issue (#986). How many days did your script run? Mine ran for about 250 hours. Do you know how many files were written before it started to write 0 bytes? I'm wondering if it's a number of files or a length of time. I've started a counter in my script to collect that information.

chrisruk commented 8 months ago

Hi David,

I'm fairly sure the first time I noticed this was with the latest kernel (6.6.20+rpt-rpi-v8).

I've grepped the logs for 'blocked for', alas it seems to occur very infrequently for me, so it may be extremely hard to replicate.

It seems to occur a number of times in a row, but it looks like I've only seen it on Dec 09, Feb 14, Mar 15, Mar 16.

Note that my script starts/stops each day via cron, running the script for around 13hrs a day.

@ajkelsey I'm not running the same process continuously I'm a afraid and I don't think I've seen 0 byte files unless I'm mistaken.

davidplowman commented 8 months ago

Well, let's put this on the list of things to try and reproduce - that's usually always the first thing to try and do!

lurch commented 7 months ago

I've now been running the capture_circular_stream.py script continuously for a whole week, and haven't seen any errors. @chrisruk Is there still plenty of free disk-space on your SD card? (I'm just guessing, but I wonder if a lack of disk-space might cause a hung_task_timeout? :shrug: )

davidplowman commented 7 months ago

It seems quite suspicious to me that the vchi error message has also been observed now in issue https://github.com/raspberrypi/picamera2/issues/986. So I do believe there's something going on here, just it's proving very difficult to provoke at our end.

On that other issue, the following suggestion has been made:

Running with start_debug=1 in config.txt, and using sudo vclog -m and sudo vclog -a after it goes wrong would be useful.

Chris, would you be able to give that a try, then maybe we can learn something more if it happens to you again?

chrisruk commented 7 months ago

I seem to have plenty of disk space @lurch -

/dev/root        58G  7.8G   47G  15% /

David, sure I'll enable that debug option, does sudo vclog -a need to be run very quickly after the hung task timeout, if so I could probably script it.

davidplowman commented 7 months ago

@6by9 Could you perhaps answer Chris's question, re. whether sudo vclog -a should be run immediately? Thanks!

6by9 commented 7 months ago

There should be relatively few asserts in the firmware on a normal system, so timing shouldn't be critical. (If you're getting lots of asserts logged under normal situations, then let us know as they ought to be fixed).

By default there is relatively little logging enabled on the firmware, so whilst there will be more churning through the vclog -m buffer, it again shouldn't be too verbose.

chrisruk commented 7 months ago

Thanks for that information! Will post an update if the issue occurs again, with more log info.

chrisruk commented 7 months ago

I've just been looking at my latest 'dmesg' output and noticed this -

[Thu Apr  4 22:45:32 2024] mmc0: Card stuck being busy! __mmc_poll_for_busy
[Thu Apr  4 22:57:17 2024] mmc0: Card stuck being busy! __mmc_poll_for_busy

Maybe my SD card is on its way out, will replace. I added a print message on each request. And noticed it stops for some reason early in the morning each day. I'm hoping this is just related to the SD card possibly being iffy.

Apr 05 02:57:15 pi2 capture_circular_stream.py[15569]: --call handle request 1712282235.357921
Apr 05 02:57:15 pi2 capture_circular_stream.py[15569]: --call handle request 1712282235.3908792
Apr 05 02:57:15 pi2 capture_circular_stream.py[15569]: --call handle request 1712282235.4242787

Will post an update after switching.

davidplowman commented 7 months ago

Hi Chris, and thanks for the update. I've been running the other case we have showing these vc_sm_cma_vchi_rx_ack errors (https://github.com/raspberrypi/picamera2/issues/986), so far for about 10 days without any problems. Given my failure to provoke any errors, and the fact that I'm on the latest OS and kernel (which is 6.6), I wonder if it might be worth everyone making sure they're fully up to date (usual caveats about backing stuff up apply!), just to be sure that I'm chasing something that is still there! But the SD card tip is probably a good one, it's always something to try (like dodgy power supplies or long cables!).

chrisruk commented 7 months ago

Hi David, I've been using the 6.6.22-v8+ kernel for a little bit now along with keeping userspace programs up to date.

I just got the original dmesg messages today -

[Sat Apr  6 08:21:45 2024] mmc0: Card stuck being busy! __mmc_poll_for_busy
[Sat Apr  6 08:43:01 2024] vc_sm_cma_vchi_rx_ack: received response 22492819, throw away...
[Sat Apr  6 08:46:03 2024] INFO: task capture_circula:21524 blocked for more than 120 seconds.
[Sat Apr  6 08:46:03 2024]       Tainted: G         C         6.6.22-v8+ #1747
[Sat Apr  6 08:46:03 2024] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sat Apr  6 08:46:03 2024] task:capture_circula state:D stack:0     pid:21524 ppid:1      flags:0x0000000c
[Sat Apr  6 08:46:03 2024] Call trace:
[Sat Apr  6 08:46:03 2024]  __switch_to+0xec/0x1d8
[Sat Apr  6 08:46:03 2024]  __schedule+0x2f0/0xac0
[Sat Apr  6 08:46:03 2024]  schedule+0x68/0x110
[Sat Apr  6 08:46:03 2024]  schedule_preempt_disabled+0x2c/0x50
[Sat Apr  6 08:46:03 2024]  __mutex_lock.isra.17+0x3e4/0xa38
[Sat Apr  6 08:46:03 2024]  __mutex_lock_slowpath+0x20/0x38
[Sat Apr  6 08:46:03 2024]  mutex_lock+0x2c/0x60
[Sat Apr  6 08:46:03 2024]  v4l2_m2m_fop_poll+0x38/0x78 [v4l2_mem2mem]
[Sat Apr  6 08:46:03 2024]  v4l2_poll+0x54/0xc8 [videodev]
[Sat Apr  6 08:46:03 2024]  do_sys_poll+0x264/0x548
[Sat Apr  6 08:46:03 2024]  __arm64_sys_ppoll+0xbc/0xf8
[Sat Apr  6 08:46:03 2024]  invoke_syscall+0x4c/0x118
[Sat Apr  6 08:46:03 2024]  el0_svc_common.constprop.1+0x98/0xf8
[Sat Apr  6 08:46:03 2024]  do_el0_svc+0x24/0x38
[Sat Apr  6 08:46:03 2024]  el0_svc+0x48/0xe8
[Sat Apr  6 08:46:03 2024]  el0t_64_sync_handler+0xa0/0xc8
[Sat Apr  6 08:46:03 2024]  el0t_64_sync+0x190/0x198

I did sudo vclog -a:

3025667.988: assert( timeout != 0 ) failed; ../../../../../vcfw/drivers/chip/vciv/2708/clock_2711.c::clock_wait_busy line 2152
----------------
1094501.493: assert( timeout != 0 ) failed; ../../../../../vcfw/drivers/chip/vciv/2708/clock_2711.c::clock_wait_busy line 2152

I'm not sure if those messages are related or not though to the dmesg output.

It looks like my camera stopped capturing video at:

Apr 06 08:42:51 pi2 capture_circular_stream.py[21510]: --call handle request 1712389371.0460396

I'm hopefully getting the new SD card today, so will replace soon.

Edit: Fitted the new SD card, will wait a week and see what happens and post an update then

chrisruk commented 7 months ago

Hi David, replacing the SD card has meant videos are now recorded throughout the time period my script runs :) (Before it seemed to be stopping receiving new frames after a period of time)

Feel free to close this issue now. I'm thinking the hanging task error might have been related to issues with the SD card.

davidplowman commented 7 months ago

Cool, thanks for the update! If it comes back... you know where we are :)