[BUG] picamera2 locks up after hours, receiving no more buffers

nzottmann commented 1 month ago

During the development of an application, which is streaming continuously using picamera2, I can not get rid of a bug, leading to a lock up of picamera2 which stops receiving buffers after multiple hours of runtime.

I tried to reproduce this with a minimal example on a pi4 with a current Raspberry Pi OS, but without success yet. Thus I hesitated for a long time to file a bug report, but I have several indices and found multiple similar issues, so I decided to collect my findings here, perhaps to receive some help or help others seeing the same symptoms.

I know can not really ask for help until I have a working minimal example on a current RPi OS, but I am happy to collect more debugging information which could help tracking this down or finding other impacted users who can share other relevant findings.

To Reproduce

Minimal example

I created a minimal example, which encodes the main and the lores streams to h264 and forwards them to ffmpeg outputs. ffmpeg sends them via rtsp to a https://github.com/bluenviron/mediamtx instance, which includes a webserver to open the stream in a browser using webrtc. After multiple hours, usually in the range of 10-20h, mediamtx logs show that the rtsp streams stopped. While debugging, I could see that the picamera2 event loops only contained empty events from this time on.

from picamera2.encoders import H264Encoder
from picamera2.outputs import FfmpegOutput
from picamera2 import Picamera2
import time

picam2 = Picamera2()

video_config = picam2.create_video_configuration(
    main={"size": (1920, 1080), "format": "RGB888"},
    lores={"size": (1280, 720), "format": "YUV420"},
    raw={"size": (2304, 1296)},
    controls={"FrameDurationLimits": (int(1e6/24), int(1e6/24))},
    display=None,
    use_case="video"
)

picam2.align_configuration(video_config)
picam2.configure(video_config)

encoder1 = H264Encoder(bitrate=10000000, framerate=24, enable_sps_framerate=True)
output1 = FfmpegOutput("-r 24 -f rtsp rtsp://127.0.0.1:8554/stream1")
picam2.start_recording(encoder1, output1, name="main")

encoder2 = H264Encoder(bitrate=10000000, framerate=24, enable_sps_framerate=True)
output2 = FfmpegOutput("-r 24 -f rtsp rtsp://127.0.0.1:8554/stream2")
picam2.start_recording(encoder2, output2, name="lores")

while True:
    time.sleep(60)

picam2.stop_recording()

Context

This minimal example is a reduced version of my more complex application, which is running on a custom yocto build on a cm4 (currently on the cm4io board) with pi camera v3.

Currently, I can not reproduce the bug on a pi4 with this minimal example. But I can reproduce the bug with

the minimal example on my custom build
my complex application on a pi4 with an up to date Raspberry Pi OS

Although I cannot prove, I think the bug is present in Raspberry Pi OS too, but subtile differences lead to a much longer runtime than on my custom build. During the last months, I observed that multiple factors changed the time to fail:

The simpler the application, the longer it runs
With one stream instead of two, the application runs for a longer time
Time to fail on Raspberry Pi OS is longer than on my custom build
libcamera, firmware and kernel versions seem to have an impact on time to fail, but I could not clearly identify a pattern here
I had a case of my application running on Bookworm for 28 days, then after a manual reboot failing in less than 24h.

On my custom build, I try to follow Raspberry Pi OS versions of related software as close as possible, currently:

Linux 6.6.40-v8
raspberrypi-firmware from 2024-05-24T15:30:04 (4942b7633c0ff1af1ee95a51a33b56a9dae47529)
libcamera v0.3.0+rpt20240617
picamera2 0.3.19

Symptoms

As described, when the failure happens, the outputs stop outputting frames. picamera2 stops supplying raw frames to the encoders.

dmesg

In most cases, shows that some kind of v4l2 frame polling method seems to block forever. The same message appears twice at the same time for two different python tasks, presumably for each stream.
Sometimes, but more rarely, there is nothing in dmesg.

[Fri Aug  9 08:32:04 2024] vc_sm_cma_vchi_rx_ack: received response 18856330, throw away...
[Fri Aug  9 08:35:29 2024] INFO: task python3:265886 blocked for more than 120 seconds.
[Fri Aug  9 08:35:29 2024]       Tainted: G         C         6.6.40-v8 #1
[Fri Aug  9 08:35:29 2024] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Aug  9 08:35:29 2024] task:python3         state:D stack:0     pid:265886 ppid:1      flags:0x0000020c
[Fri Aug  9 08:35:29 2024] Call trace:
[Fri Aug  9 08:35:29 2024]  __switch_to+0xe0/0x160
[Fri Aug  9 08:35:29 2024]  __schedule+0x37c/0xd68
[Fri Aug  9 08:35:29 2024]  schedule+0x64/0x108
[Fri Aug  9 08:35:29 2024]  schedule_preempt_disabled+0x2c/0x50
[Fri Aug  9 08:35:29 2024]  __mutex_lock.constprop.0+0x2f4/0x5a8
[Fri Aug  9 08:35:29 2024]  __mutex_lock_slowpath+0x1c/0x30
[Fri Aug  9 08:35:29 2024]  mutex_lock+0x50/0x68
[Fri Aug  9 08:35:29 2024]  v4l2_m2m_fop_poll+0x38/0x80 [v4l2_mem2mem]
[Fri Aug  9 08:35:29 2024]  v4l2_poll+0x54/0xc8 [videodev]
[Fri Aug  9 08:35:29 2024]  do_sys_poll+0x2b0/0x5c8
[Fri Aug  9 08:35:29 2024]  __arm64_sys_ppoll+0xb4/0x148
[Fri Aug  9 08:35:29 2024]  invoke_syscall+0x50/0x128
[Fri Aug  9 08:35:29 2024]  el0_svc_common.constprop.0+0xc8/0xf0
[Fri Aug  9 08:35:29 2024]  do_el0_svc+0x24/0x38
[Fri Aug  9 08:35:29 2024]  el0_svc+0x40/0xe8
[Fri Aug  9 08:35:29 2024]  el0t_64_sync_handler+0x120/0x130
[Fri Aug  9 08:35:29 2024]  el0t_64_sync+0x190/0x198

Sometimes I see a

[Thu Aug  8 08:56:34 2024] vc_sm_cma_import_dmabuf: imported vc_sm_cma_get_buffer failed -512

load

Looking at top, I saw a load of 2.00, but no processes showing more than a few % of cpu. A quick research lead to IO workload as a possible cause, perhaps the never returning v4l2 poll.

Resolving the situation

After killing and restarting the application, it works again for a few hours. A reboot does not change anything.

Related issues

Lots of research lead me to related issues, having their origin perhaps in the same, hard to track issue.

1086 finally made me to open up this issue
In #1004 someone recently showed up with the same symptoms (my issue here is different from the already fixed firmware bug mentioned there. With the affected firmware, my application crashed much faster as described there)
https://forums.raspberrypi.com/viewtopic.php?t=250634 ( gstreamer: v4l2h264enc stops processing frames)
https://github.com/raspberrypi/firmware/issues/1298
https://github.com/raspberrypi/firmware/issues/1613
https://github.com/raspberrypi/picamera2/issues/986

djhanove commented 1 month ago

Can support this is happening to me as well. Running a large fleet of Pi 4's on a complex application. Completely randomly I stop receiving buffers and it will hang indefinitely until the camera is closed and restarted or the system is rebooted. I now can handle it gracefully via a watchdog thread but it is quite concerning. I am not seeing any dmesg logs (likely due to ramdisk in use for logging to save SD cards) and I get no errors in the program when this happens.

dmunnet commented 4 weeks ago

I can also confirm that I'm encountering the same issue. Pi4 4MB RAM running bookworm OS that is up to date. In my application I periodically capture video footage when sensor input is detected on a GPIO line. When the bug occurs, I do not get any error message and the program appears to be running as it should but the recorded video files are zero-length. After forcing the program to quit, I get the same dmesg that others have posted...

[ 9970.424030] vc_sm_cma_vchi_rx_ack: received response 850337, throw away...
[59407.677412] vc_sm_cma_import_dmabuf: imported vc_sm_cma_get_buffer failed -512
[59407.677432] bcm2835_mmal_vchiq: vchiq_mmal_submit_buffer: vc_sm_import_dmabuf_fd failed, ret -512
[59407.677437] bcm2835-codec bcm2835-codec: device_run: Failed submitting ip buffer
[59407.696712] ------------[ cut here ]------------
[59407.696731] WARNING: CPU: 1 PID: 3746 at drivers/media/common/videobuf2/videobuf2-core.c:2024 __vb2_queue_cancel+0x220/0x2a0 [videobuf2_common]
[59407.696775] Modules linked in: rfcomm snd_seq_dummy snd_hrtimer snd_seq snd_seq_device cmac algif_hash aes_arm64 aes_generic algif_skcipher af_alg bnep hci_uart btbcm brcmfmac_wcc bluetooth brcmfmac brcmutil cfg80211 ov5647 ecdh_generic ecc bcm2835_unicam libaes rfkill v4l2_dv_timings v4l2_fwnode raspberrypi_hwmon v4l2_async binfmt_misc bcm2835_codec(C) bcm2835_v4l2(C) rpivid_hevc(C) bcm2835_isp(C) v4l2_mem2mem bcm2835_mmal_vchiq(C) videobuf2_vmalloc videobuf2_dma_contig videobuf2_memops videobuf2_v4l2 raspberrypi_gpiomem videodev vc_sm_cma(C) videobuf2_common snd_bcm2835(C) mc nvmem_rmem uio_pdrv_genirq uio i2c_dev fuse dm_mod ip_tables x_tables ipv6 rtc_ds1307 spidev regmap_i2c vc4 snd_soc_hdmi_codec drm_display_helper cec v3d drm_dma_helper gpu_sched i2c_mux_pinctrl drm_shmem_helper i2c_mux drm_kms_helper i2c_brcmstb spi_bcm2835 drm i2c_bcm2835 drm_panel_orientation_quirks snd_soc_core snd_compress snd_pcm_dmaengine snd_pcm snd_timer snd backlight
[59407.696947] CPU: 1 PID: 3746 Comm: python3.11 Tainted: G         C         6.6.31+rpt-rpi-v8 #1  Debian 1:6.6.31-1+rpt1
[59407.696955] Hardware name: Raspberry Pi 4 Model B Rev 1.5 (DT)
[59407.696958] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[59407.696964] pc : __vb2_queue_cancel+0x220/0x2a0 [videobuf2_common]
[59407.696980] lr : __vb2_queue_cancel+0x38/0x2a0 [videobuf2_common]
[59407.696994] sp : ffffffc0831f3a70
[59407.696997] x29: ffffffc0831f3a70 x28: ffffff80ba3887f8 x27: 0000000000000009
[59407.697006] x26: 0000000000000001 x25: 00000000f7b98094 x24: ffffff80ba388600
[59407.697015] x23: ffffff80401c92e0 x22: ffffff80bae57a98 x21: ffffff8041e3cd30
[59407.697024] x20: ffffff80bae57b40 x19: ffffff80bae57a98 x18: 0000000000000003
[59407.697032] x17: 0000000000000000 x16: ffffffd7102d4348 x15: ffffffc0831f35c0
[59407.697039] x14: 0000000000000004 x13: ffffff8043740028 x12: 0000000000000000
[59407.697047] x11: ffffff80b57462f8 x10: ffffff80b5746238 x9 : ffffffd71032baf8
[59407.697055] x8 : ffffffc0831f3970 x7 : 0000000000000000 x6 : 0000000000000228
[59407.697063] x5 : ffffff80459f4e40 x4 : fffffffe01167d20 x3 : 0000000080150013
[59407.697071] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000001
[59407.697079] Call trace:
[59407.697085]  __vb2_queue_cancel+0x220/0x2a0 [videobuf2_common]
[59407.697101]  vb2_core_queue_release+0x2c/0x60 [videobuf2_common]
[59407.697115]  vb2_queue_release+0x18/0x30 [videobuf2_v4l2]
[59407.697136]  v4l2_m2m_ctx_release+0x30/0x50 [v4l2_mem2mem]
[59407.697164]  bcm2835_codec_release+0x64/0x110 [bcm2835_codec]
[59407.697178]  v4l2_release+0xec/0x100 [videodev]
[59407.697282]  __fput+0xbc/0x288
[59407.697292]  ____fput+0x18/0x30
[59407.697296]  task_work_run+0x80/0xe0
[59407.697306]  do_exit+0x30c/0x988
[59407.697311]  do_group_exit+0x3c/0xa0
[59407.697315]  get_signal+0x980/0x9b0
[59407.697320]  do_notify_resume+0x318/0x1370
[59407.697325]  el0_svc_compat+0x78/0x88
[59407.697335]  el0t_32_sync_handler+0x98/0x140
[59407.697341]  el0t_32_sync+0x194/0x198
[59407.697346] ---[ end trace 0000000000000000 ]---
[59407.697353] videobuf2_common: driver bug: stop_streaming operation is leaving buf 00000000680707d2 in active  #state

So far, I have not been able to drive the bug with a minimal version that just features the video capture. But the bug is intermittent, so maybe I just haven't run the test code long enough for it to crop up. I will kept testing.

@djhanove, are you able to share how you handle it using the watchdog thread?

djhanove commented 4 weeks ago

@dmunnet something like this, I'll let you figure out the rest

import time
from picamera2 import Picamera2
import threading

import logging

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class CameraManager:
    def __init__(self):

        self.picam2 = Picamera2()
        capture_config = self.picam2.create_video_configuration(
            main={"size": (640, 480), "format": "RGB888"},
            raw={"size": self.picam2.sensor_resolution},
            buffer_count=6,
            queue=True
        )
        self.picam2.configure(capture_config)
        self.picam2.start(show_preview=False)

        self.heatbeat_lock = threading.Lock()
        self.last_heartbeat = time.time()

        self.watchdog_thread = threading.Thread(target=self._watchdog)
        self.watchdog_thread.start()

        self.camera_thread = threading.Thread(target=self._capture_continuous)
        self.camera_thread.start()

    def _watchdog(self, timeout=10):  # Timeout in seconds
        while True:
            with self.heatbeat_lock:
                last_heartbeat = self.last_heartbeat
            if time.time() - last_heartbeat > timeout:
                logger.error("Camera capture thread unresponsive, attempting to reset.")
            time.sleep(timeout)

    def _capture_continuous(self):
        while True:
            try:
                frame = self.picam2.capture_array()
                self.signal_heartbeat()  # Signal that the thread is alive
            except Exception as e:
                logger.error(f"Failed to capture frame: {e}")
                break

            time.sleep(1 / 25)

    def signal_heartbeat(self):
        with self.heatbeat_lock:
            self.last_heartbeat = time.time()

if __name__ == "__main__":
    camera_manager = CameraManager()
    while True:
        time.sleep(1)

dmunnet commented 4 weeks ago

@djhanove, thanks for this! If I learn more about this issue in the course of my testing, I'll post an update.

caracoluk commented 3 weeks ago

I can also confirm I've seen what appears to be the same problem. While developing updated Python code using PiCamera2 to support the Pi Camera Module 3 as part of the "My Naturewatch Camera Server" I ran into this issue. I'm using a Pi Zero 2W and ended up testing on two separate Pi Zero 2Ws, both of which are running identical code and both have the problem. Interestingly I tested the same code using an older Pi Camera module and couldn't reproduce the problem.

I tried to find the simplest code that would reproduce the problem, but given that the failure can occur anywhere between 1-7 days on average it's not been easy to say for sure whether the problem is present in test code or not. The app I've been developing also uses the H.264 encoder using a circular buffer to write out to a file once motion has been detected. Usually when the problem occurs I see "imported vc_sm_cma_get_buffer failed -512" written in the dmesg log, but often this is only after I've restarted the Python process (the date stamp still shows the time the application stopped functioning). If I carry out the command "libcamera-hello --list-devices" once the system has entered the failed state it will hang indefinitely. I ended up creating a watchdog to restart the Python process instead as a workaround. There's clearly a problem, and I'd come across the same issues that others have linked to above, but in each case those seemed to be either related to a few kernel versions that had a separate problem, or were caused by faulty SD cards.

I shall keep an eye on this thread with interest as it would be good to resolve this issue properly.