raspberrypi / picamera2

New libcamera based python library
BSD 2-Clause "Simplified" License
836 stars 179 forks source link

[BUG] picamera2 locks up after hours, receiving no more buffers #1090

Open nzottmann opened 1 month ago

nzottmann commented 1 month ago

During the development of an application, which is streaming continuously using picamera2, I can not get rid of a bug, leading to a lock up of picamera2 which stops receiving buffers after multiple hours of runtime.

I tried to reproduce this with a minimal example on a pi4 with a current Raspberry Pi OS, but without success yet. Thus I hesitated for a long time to file a bug report, but I have several indices and found multiple similar issues, so I decided to collect my findings here, perhaps to receive some help or help others seeing the same symptoms.

I know can not really ask for help until I have a working minimal example on a current RPi OS, but I am happy to collect more debugging information which could help tracking this down or finding other impacted users who can share other relevant findings.

To Reproduce

Minimal example

I created a minimal example, which encodes the main and the lores streams to h264 and forwards them to ffmpeg outputs. ffmpeg sends them via rtsp to a https://github.com/bluenviron/mediamtx instance, which includes a webserver to open the stream in a browser using webrtc. After multiple hours, usually in the range of 10-20h, mediamtx logs show that the rtsp streams stopped. While debugging, I could see that the picamera2 event loops only contained empty events from this time on.

from picamera2.encoders import H264Encoder
from picamera2.outputs import FfmpegOutput
from picamera2 import Picamera2
import time

picam2 = Picamera2()

video_config = picam2.create_video_configuration(
    main={"size": (1920, 1080), "format": "RGB888"},
    lores={"size": (1280, 720), "format": "YUV420"},
    raw={"size": (2304, 1296)},
    controls={"FrameDurationLimits": (int(1e6/24), int(1e6/24))},
    display=None,
    use_case="video"
)

picam2.align_configuration(video_config)
picam2.configure(video_config)

encoder1 = H264Encoder(bitrate=10000000, framerate=24, enable_sps_framerate=True)
output1 = FfmpegOutput("-r 24 -f rtsp rtsp://127.0.0.1:8554/stream1")
picam2.start_recording(encoder1, output1, name="main")

encoder2 = H264Encoder(bitrate=10000000, framerate=24, enable_sps_framerate=True)
output2 = FfmpegOutput("-r 24 -f rtsp rtsp://127.0.0.1:8554/stream2")
picam2.start_recording(encoder2, output2, name="lores")

while True:
    time.sleep(60)

picam2.stop_recording()

Context

This minimal example is a reduced version of my more complex application, which is running on a custom yocto build on a cm4 (currently on the cm4io board) with pi camera v3.

Currently, I can not reproduce the bug on a pi4 with this minimal example. But I can reproduce the bug with

Although I cannot prove, I think the bug is present in Raspberry Pi OS too, but subtile differences lead to a much longer runtime than on my custom build. During the last months, I observed that multiple factors changed the time to fail:

On my custom build, I try to follow Raspberry Pi OS versions of related software as close as possible, currently:

Symptoms

As described, when the failure happens, the outputs stop outputting frames. picamera2 stops supplying raw frames to the encoders.

dmesg

In most cases, shows that some kind of v4l2 frame polling method seems to block forever. The same message appears twice at the same time for two different python tasks, presumably for each stream.
Sometimes, but more rarely, there is nothing in dmesg.

[Fri Aug  9 08:32:04 2024] vc_sm_cma_vchi_rx_ack: received response 18856330, throw away...
[Fri Aug  9 08:35:29 2024] INFO: task python3:265886 blocked for more than 120 seconds.
[Fri Aug  9 08:35:29 2024]       Tainted: G         C         6.6.40-v8 #1
[Fri Aug  9 08:35:29 2024] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Aug  9 08:35:29 2024] task:python3         state:D stack:0     pid:265886 ppid:1      flags:0x0000020c
[Fri Aug  9 08:35:29 2024] Call trace:
[Fri Aug  9 08:35:29 2024]  __switch_to+0xe0/0x160
[Fri Aug  9 08:35:29 2024]  __schedule+0x37c/0xd68
[Fri Aug  9 08:35:29 2024]  schedule+0x64/0x108
[Fri Aug  9 08:35:29 2024]  schedule_preempt_disabled+0x2c/0x50
[Fri Aug  9 08:35:29 2024]  __mutex_lock.constprop.0+0x2f4/0x5a8
[Fri Aug  9 08:35:29 2024]  __mutex_lock_slowpath+0x1c/0x30
[Fri Aug  9 08:35:29 2024]  mutex_lock+0x50/0x68
[Fri Aug  9 08:35:29 2024]  v4l2_m2m_fop_poll+0x38/0x80 [v4l2_mem2mem]
[Fri Aug  9 08:35:29 2024]  v4l2_poll+0x54/0xc8 [videodev]
[Fri Aug  9 08:35:29 2024]  do_sys_poll+0x2b0/0x5c8
[Fri Aug  9 08:35:29 2024]  __arm64_sys_ppoll+0xb4/0x148
[Fri Aug  9 08:35:29 2024]  invoke_syscall+0x50/0x128
[Fri Aug  9 08:35:29 2024]  el0_svc_common.constprop.0+0xc8/0xf0
[Fri Aug  9 08:35:29 2024]  do_el0_svc+0x24/0x38
[Fri Aug  9 08:35:29 2024]  el0_svc+0x40/0xe8
[Fri Aug  9 08:35:29 2024]  el0t_64_sync_handler+0x120/0x130
[Fri Aug  9 08:35:29 2024]  el0t_64_sync+0x190/0x198

Sometimes I see a

[Thu Aug  8 08:56:34 2024] vc_sm_cma_import_dmabuf: imported vc_sm_cma_get_buffer failed -512

load

Looking at top, I saw a load of 2.00, but no processes showing more than a few % of cpu. A quick research lead to IO workload as a possible cause, perhaps the never returning v4l2 poll.

Resolving the situation

After killing and restarting the application, it works again for a few hours. A reboot does not change anything.

Related issues

Lots of research lead me to related issues, having their origin perhaps in the same, hard to track issue.

djhanove commented 1 month ago

Can support this is happening to me as well. Running a large fleet of Pi 4's on a complex application. Completely randomly I stop receiving buffers and it will hang indefinitely until the camera is closed and restarted or the system is rebooted. I now can handle it gracefully via a watchdog thread but it is quite concerning. I am not seeing any dmesg logs (likely due to ramdisk in use for logging to save SD cards) and I get no errors in the program when this happens.

dmunnet commented 4 weeks ago

I can also confirm that I'm encountering the same issue. Pi4 4MB RAM running bookworm OS that is up to date. In my application I periodically capture video footage when sensor input is detected on a GPIO line. When the bug occurs, I do not get any error message and the program appears to be running as it should but the recorded video files are zero-length. After forcing the program to quit, I get the same dmesg that others have posted...

[ 9970.424030] vc_sm_cma_vchi_rx_ack: received response 850337, throw away...
[59407.677412] vc_sm_cma_import_dmabuf: imported vc_sm_cma_get_buffer failed -512
[59407.677432] bcm2835_mmal_vchiq: vchiq_mmal_submit_buffer: vc_sm_import_dmabuf_fd failed, ret -512
[59407.677437] bcm2835-codec bcm2835-codec: device_run: Failed submitting ip buffer
[59407.696712] ------------[ cut here ]------------
[59407.696731] WARNING: CPU: 1 PID: 3746 at drivers/media/common/videobuf2/videobuf2-core.c:2024 __vb2_queue_cancel+0x220/0x2a0 [videobuf2_common]
[59407.696775] Modules linked in: rfcomm snd_seq_dummy snd_hrtimer snd_seq snd_seq_device cmac algif_hash aes_arm64 aes_generic algif_skcipher af_alg bnep hci_uart btbcm brcmfmac_wcc bluetooth brcmfmac brcmutil cfg80211 ov5647 ecdh_generic ecc bcm2835_unicam libaes rfkill v4l2_dv_timings v4l2_fwnode raspberrypi_hwmon v4l2_async binfmt_misc bcm2835_codec(C) bcm2835_v4l2(C) rpivid_hevc(C) bcm2835_isp(C) v4l2_mem2mem bcm2835_mmal_vchiq(C) videobuf2_vmalloc videobuf2_dma_contig videobuf2_memops videobuf2_v4l2 raspberrypi_gpiomem videodev vc_sm_cma(C) videobuf2_common snd_bcm2835(C) mc nvmem_rmem uio_pdrv_genirq uio i2c_dev fuse dm_mod ip_tables x_tables ipv6 rtc_ds1307 spidev regmap_i2c vc4 snd_soc_hdmi_codec drm_display_helper cec v3d drm_dma_helper gpu_sched i2c_mux_pinctrl drm_shmem_helper i2c_mux drm_kms_helper i2c_brcmstb spi_bcm2835 drm i2c_bcm2835 drm_panel_orientation_quirks snd_soc_core snd_compress snd_pcm_dmaengine snd_pcm snd_timer snd backlight
[59407.696947] CPU: 1 PID: 3746 Comm: python3.11 Tainted: G         C         6.6.31+rpt-rpi-v8 #1  Debian 1:6.6.31-1+rpt1
[59407.696955] Hardware name: Raspberry Pi 4 Model B Rev 1.5 (DT)
[59407.696958] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[59407.696964] pc : __vb2_queue_cancel+0x220/0x2a0 [videobuf2_common]
[59407.696980] lr : __vb2_queue_cancel+0x38/0x2a0 [videobuf2_common]
[59407.696994] sp : ffffffc0831f3a70
[59407.696997] x29: ffffffc0831f3a70 x28: ffffff80ba3887f8 x27: 0000000000000009
[59407.697006] x26: 0000000000000001 x25: 00000000f7b98094 x24: ffffff80ba388600
[59407.697015] x23: ffffff80401c92e0 x22: ffffff80bae57a98 x21: ffffff8041e3cd30
[59407.697024] x20: ffffff80bae57b40 x19: ffffff80bae57a98 x18: 0000000000000003
[59407.697032] x17: 0000000000000000 x16: ffffffd7102d4348 x15: ffffffc0831f35c0
[59407.697039] x14: 0000000000000004 x13: ffffff8043740028 x12: 0000000000000000
[59407.697047] x11: ffffff80b57462f8 x10: ffffff80b5746238 x9 : ffffffd71032baf8
[59407.697055] x8 : ffffffc0831f3970 x7 : 0000000000000000 x6 : 0000000000000228
[59407.697063] x5 : ffffff80459f4e40 x4 : fffffffe01167d20 x3 : 0000000080150013
[59407.697071] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000001
[59407.697079] Call trace:
[59407.697085]  __vb2_queue_cancel+0x220/0x2a0 [videobuf2_common]
[59407.697101]  vb2_core_queue_release+0x2c/0x60 [videobuf2_common]
[59407.697115]  vb2_queue_release+0x18/0x30 [videobuf2_v4l2]
[59407.697136]  v4l2_m2m_ctx_release+0x30/0x50 [v4l2_mem2mem]
[59407.697164]  bcm2835_codec_release+0x64/0x110 [bcm2835_codec]
[59407.697178]  v4l2_release+0xec/0x100 [videodev]
[59407.697282]  __fput+0xbc/0x288
[59407.697292]  ____fput+0x18/0x30
[59407.697296]  task_work_run+0x80/0xe0
[59407.697306]  do_exit+0x30c/0x988
[59407.697311]  do_group_exit+0x3c/0xa0
[59407.697315]  get_signal+0x980/0x9b0
[59407.697320]  do_notify_resume+0x318/0x1370
[59407.697325]  el0_svc_compat+0x78/0x88
[59407.697335]  el0t_32_sync_handler+0x98/0x140
[59407.697341]  el0t_32_sync+0x194/0x198
[59407.697346] ---[ end trace 0000000000000000 ]---
[59407.697353] videobuf2_common: driver bug: stop_streaming operation is leaving buf 00000000680707d2 in active  #state

So far, I have not been able to drive the bug with a minimal version that just features the video capture. But the bug is intermittent, so maybe I just haven't run the test code long enough for it to crop up. I will kept testing.

@djhanove, are you able to share how you handle it using the watchdog thread?

djhanove commented 4 weeks ago

@dmunnet something like this, I'll let you figure out the rest

import time
from picamera2 import Picamera2
import threading

import logging

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class CameraManager:
    def __init__(self):

        self.picam2 = Picamera2()
        capture_config = self.picam2.create_video_configuration(
            main={"size": (640, 480), "format": "RGB888"},
            raw={"size": self.picam2.sensor_resolution},
            buffer_count=6,
            queue=True
        )
        self.picam2.configure(capture_config)
        self.picam2.start(show_preview=False)

        self.heatbeat_lock = threading.Lock()
        self.last_heartbeat = time.time()

        self.watchdog_thread = threading.Thread(target=self._watchdog)
        self.watchdog_thread.start()

        self.camera_thread = threading.Thread(target=self._capture_continuous)
        self.camera_thread.start()

    def _watchdog(self, timeout=10):  # Timeout in seconds
        while True:
            with self.heatbeat_lock:
                last_heartbeat = self.last_heartbeat
            if time.time() - last_heartbeat > timeout:
                logger.error("Camera capture thread unresponsive, attempting to reset.")
            time.sleep(timeout)

    def _capture_continuous(self):
        while True:
            try:
                frame = self.picam2.capture_array()
                self.signal_heartbeat()  # Signal that the thread is alive
            except Exception as e:
                logger.error(f"Failed to capture frame: {e}")
                break

            time.sleep(1 / 25)

    def signal_heartbeat(self):
        with self.heatbeat_lock:
            self.last_heartbeat = time.time()

if __name__ == "__main__":
    camera_manager = CameraManager()
    while True:
        time.sleep(1)
dmunnet commented 4 weeks ago

@djhanove, thanks for this! If I learn more about this issue in the course of my testing, I'll post an update.

caracoluk commented 3 weeks ago

I can also confirm I've seen what appears to be the same problem. While developing updated Python code using PiCamera2 to support the Pi Camera Module 3 as part of the "My Naturewatch Camera Server" I ran into this issue. I'm using a Pi Zero 2W and ended up testing on two separate Pi Zero 2Ws, both of which are running identical code and both have the problem. Interestingly I tested the same code using an older Pi Camera module and couldn't reproduce the problem.

I tried to find the simplest code that would reproduce the problem, but given that the failure can occur anywhere between 1-7 days on average it's not been easy to say for sure whether the problem is present in test code or not. The app I've been developing also uses the H.264 encoder using a circular buffer to write out to a file once motion has been detected. Usually when the problem occurs I see "imported vc_sm_cma_get_buffer failed -512" written in the dmesg log, but often this is only after I've restarted the Python process (the date stamp still shows the time the application stopped functioning). If I carry out the command "libcamera-hello --list-devices" once the system has entered the failed state it will hang indefinitely. I ended up creating a watchdog to restart the Python process instead as a workaround. There's clearly a problem, and I'd come across the same issues that others have linked to above, but in each case those seemed to be either related to a few kernel versions that had a separate problem, or were caused by faulty SD cards.

I shall keep an eye on this thread with interest as it would be good to resolve this issue properly.

djhanove commented 3 weeks ago

@caracoluk I have around 55 3b+'s with V1 camera modules where this doesn't happen (obviously older kernel and firmware) I didn't think that it could be the camera module version, but you may be onto something. We swapped to pi4's and the v3 camera module at the same time and it is only happening on our newer builds.

caracoluk commented 3 weeks ago

@caracoluk I have around 55 3b+'s with V1 camera modules where this doesn't happen (obviously older kernel and firmware) I didn't think that it could be the camera module version, but you may be onto something. We swapped to pi4's and the v3 camera module at the same time and it is only happening on our newer builds.

I did try lowering the frame rate to 15fps (from 25fps), thinking that the issue might be related to the increased resources required handling the larger frames from the Camera module 3. That didn't seem to make any difference. I enabled debugging using the "vclog -m" and "vclog -a" commands and nothing at all was shown in these logs when the problem occurs. The only logging I've found at all is in dmesg and from some of the other threads it seems like these are just symptoms of the problem and offer little insight into the underlying cause. I realise it's not going to be easy for the Pi developers to resolve without them having a means of easily/quickly reproducing the problem.

djhanove commented 3 weeks ago

@caracoluk I ran with 4 threads on max stress test for days at a time to try and induce it and also did not see increased frequency.

agreed, not an easy one to debug

nzottmann commented 3 weeks ago

I can confirm that. Tried also max. cpu stress without difference. For memory, I have 2/3 of cma and half of system memory free while the application is running.

djhanove commented 3 weeks ago

@caracoluk I have around 55 3b+'s with V1 camera modules where this doesn't happen (obviously older kernel and firmware) I didn't think that it could be the camera module version, but you may be onto something. We swapped to pi4's and the v3 camera module at the same time and it is only happening on our newer builds.

I did try lowering the frame rate to 15fps (from 25fps), thinking that the issue might be related to the increased resources required handling the larger frames from the Camera module 3. That didn't seem to make any difference. I enabled debugging using the "vclog -m" and "vclog -a" commands and nothing at all was shown in these logs when the problem occurs. The only logging I've found at all is in dmesg and from some of the other threads it seems like these are just symptoms of the problem and offer little insight into the underlying cause. I realise it's not going to be easy for the Pi developers to resolve without them having a means of easily/quickly reproducing the problem.

Looking back at my alert logs, I did have some pi 4b's with V1 camera modules that also exhibited this so it is not exclusive to the v3 module.

caracoluk commented 3 weeks ago

@djhanove that's useful to know. In that case this problem is likely to be affecting quite a few people. Did you notice whether the pi's with the V1 camera modules had been running for a lot longer than those with the V3 modules before they failed?

djhanove commented 3 weeks ago

@djhanove that's useful to know. In that case this problem is likely to be affecting quite a few people. Did you notice whether the pi's with the V1 camera modules had been running for a lot longer than those with the V3 modules before they failed?

Didn't seem to make a difference for me. I just swapped some back to the V1 camera yesterday and had 2 lockups in 24 hours on one device.

caracoluk commented 3 weeks ago

What I have noticed is that I get the lockups more frequently if the Pi's are running at a higher temperature. Over the warmer weeks (when the CPU temperature was showing temperatures of around 72-80C) I'd see the lockups on average once every 2 days. Now it's a bit cooler, the CPU temperature is showing 65-70C, and haven't had a lockup on either of my Pi's for the last 5 days so far. That might explain why when I was trying to create the shortest segment of code to reproduce the problem I was failing to do so. Not because the problem was no longer present, but because it would take longer to occur. I see that some of you have run CPU stress tests to try and cause the problem, and that should have pushed the temperature up a fair bit, so it's difficult to say if this is just a coincidence or not.

djhanove commented 3 weeks ago

@caracoluk all of my devices have heat sinks on them which keeps the temps in the 60-65C range even under full load.

lowflyerUK commented 2 weeks ago

I am experiencing the same issue on a CM4 running up to date bookworm. It never showed these errors while running the same picamera2 apps on bullseye. I have a raspberry pi camera module 3 on one port and an ov5647 on the other. The same fault occurs when running with only the camera module 3, although less frequently.

Headless CM4 running 6.6.31-1+rpt1 (2024-05-29) aarch64. Fresh install then sudo apt update, sudo apt upgrade.

Official Raspberry pi 5 power supply 25W. vcgencmd get_throttled always gives 0x0.

Both camera cables short - about 15cm. Possible source of radio interference from WiFi antenna connected to the CM4 running hostapd for local WiFi hotspot.

I'm working on a minimal program that exhibits the fault. I am also trying to see if the WiFI hotspot makes any difference (so far no obvious difference). I will also try more cooling on the CM4.

The program is based on the very wonderful picamera2 examples. Combination of mjpeg_server, capture_circular, capture_video_multiple together with opencv object detection. Most of the time each instance uses less than 30% of 1 cpu, so top rarely shows less than 80% idle. In normal use no swap is used. 1400MBytes available on 2GByte CM4. Normally there are no clients for the http mjpeg servers - and no correlation with the occurrence of the fault.

Every few days one or other of the picamera2 processes hangs - no frames received. I use systemd notify with a 10sec watchdog to kill and relaunch. Most of the time after a hang, systemd managed to relaunch successfully. Sometimes it throws kernel errors as reported above.

vc_sm_cma_vchi_rx_ack: received response 78529294, throw away...

Once or twice both processes have stalled at the same time (both receiving no frames) and that seems to lead to a disaster. Systemd can't kill either process, so stale zombies pile up and eventually the 2GByte swapfile is all used up. The only way out is to reboot.

I'll keep you posted.

Akkiesoft commented 2 weeks ago

I have same issue.

I am using two cameras for MJPEG live streaming. The problem has not occurred with Bullseye and has been a problem since the migration to Bookworm.

The lockups seem to happen in half a day to a day. I restart the script every hour as a workaround.

I have obtained the kernel error logs and have uploaded them to the following gist URL. https://gist.github.com/Akkiesoft/4400a32e4488bf58b7ce891781019399 https://gist.github.com/Akkiesoft/95e9cfcd24023dd9b3b8009fc07a472c

lowflyerUK commented 1 week ago

Actually I am finding it hard to narrow this down.

I can rule out the possible interference of the WiFi hotspot - the errors turn up the same with it switched off.

The frame rate makes a big difference. With both processes running at 20 frames per second, the errors come every few hours on both cameras, and the disaster (both processes hanging and failing to be killed, so runs out of memory) happens after a few days. With one process running at 10 frames per second and the other at 20, the errors are less frequent and with both processes at 10 frames per second the errors are much less frequent (and so far no disasters).

Still working on it...

caracoluk commented 1 week ago

@lowflyerUK I find that my app runs between 2-10 days without failing with the frame rate set to 20fps and a resolution of 1920x1080 using the camera module 3 on a Pi Zero 2W. I have two duplicate sets of hardware and both seem to exhibit the problem to the same degree. I did notice that on hot days the lock ups occur more frequently as I'm only using a heatsink with no active cooling. The reported CPU temperature has gone above 80C on hot days which seems to reduce the mean time between failures for me.

lowflyerUK commented 1 week ago

@caracoluk Thanks for your comment. The cpu in this CM4 does run rather warm - between 70-80deg, although it runs normally with 90% idle. Last night I added a permanent stress process that uses 100% of one core. This pushed the temperature to around 83deg so the cpu was occasionally throttled, but no errors for 4 hours - so no dramatic difference (but only for 4 hours). I'll see if I can think of a way to improve the cooling.

Update... I tried with 2 cores stressed. It failed within 2 hours on both camera processes, leading to my previously mentioned disaster, where defunct processes fill up the memory. So temperature could be a contributory factor for me, and/or possibly my picamera2 process fails if the cpu is busy.

caracoluk commented 1 week ago

@lowflyerUK if we could find a way to reliably reproduce the problem within as short a time frame as possible it would make it easier for the Pi developers to investigate. Ideally we'd need the sample code to be as simple as possible as I see from other threads that is usually their first request when looking into a problem.

lowflyerUK commented 1 week ago

@caracoluk Thanks for your encouragement! Yes, that is exactly what I am trying to do.

naroin commented 6 days ago

Hi,

It looks like we are dealing with similar issue with our custom app (https://forums.raspberrypi.com/viewtopic.php?t=359992 ), and we are still debugging. It seems for us that the response received is the repetition of an old message received 128 messages ago. This corresponds to the depth of vchiq_slot_info[] array.

lowflyerUK commented 5 days ago

@naroin Many thanks for pointing out that thread. My errors do indeed seem remarkably similar. Typically I get:

Aug 30 02:01:53 nestboxcam3 systemd[1]: gardencam.service: Watchdog timeout (limit 10s)!
Aug 30 02:01:53 nestboxcam3 kernel: vc_sm_cma_vchi_rx_ack: received response 9759796, throw away...
Aug 30 02:01:53 nestboxcam3 kernel: vc_sm_cma_import_dmabuf: imported vc_sm_cma_get_buffer failed -512

after which systemd is able to relaunch the app.

Fully up to date 64 bit Raspberry Pi OS on CM4 with 2GBytes RAM and 2GBytes swap. 2 separate picamera2 processes: One (ov5647) with 2 streams at 10 frames per sec: 320x240 with MJPEGEncoder and 1296x972 with H264Encoder. The other (imx708) with 2 streams at 10 frames per sec: 384x216 with MJPEGEncoder and 1536x864 with H264Encoder.

Most of the time top shows around 80% idle.

If my sums are right, that adds up to around 27MPix/sec for all 4 encodes. So less than half of 1080p30 and about an eighth of the 220MPix/sec that @6by9 told us was the rated maximium for the ISP.

Maybe the CM4 can't reliably encode 4 outputs at once, even at 10 frames per sec? Should I try a Raspberry Pi 5?

lowflyerUK commented 1 day ago

I haven't found simple sample code that replicates my issue. So I am inclined to feel that the ISP rate limit is the cause. In my case I think I can make a workaround by only issuing frames to the MJEPencoder whan a client is actually watching the stream. As I am the only client and I hardly ever watch the realtime stream, the probability of failure will be a lot less. This obviously won't be a solution for everybody.

caracoluk commented 1 day ago

Interestingly my two Pi Zero 2Ws haven't had a lock up in the last 3 weeks after they would previously run for 2-5 days before it happened. There have been up software updates on either of them and I've not changed the code in any way, just left them running in the same positions. The only thing I'm aware of that has changed is the temperature as it has been quite a bit cooler recently. I remember reading somewhere that the Pi starts to throttle the CPU if the temperature increases above 80C and I was seeing temperatures reach this level. Perhaps the CPU throttling makes it more likely for this issue to occur?