Closed nzottmann closed 2 weeks ago
Can support this is happening to me as well. Running a large fleet of Pi 4's on a complex application. Completely randomly I stop receiving buffers and it will hang indefinitely until the camera is closed and restarted or the system is rebooted. I now can handle it gracefully via a watchdog thread but it is quite concerning. I am not seeing any dmesg logs (likely due to ramdisk in use for logging to save SD cards) and I get no errors in the program when this happens.
I can also confirm that I'm encountering the same issue. Pi4 4MB RAM running bookworm OS that is up to date. In my application I periodically capture video footage when sensor input is detected on a GPIO line. When the bug occurs, I do not get any error message and the program appears to be running as it should but the recorded video files are zero-length. After forcing the program to quit, I get the same dmesg that others have posted...
[ 9970.424030] vc_sm_cma_vchi_rx_ack: received response 850337, throw away...
[59407.677412] vc_sm_cma_import_dmabuf: imported vc_sm_cma_get_buffer failed -512
[59407.677432] bcm2835_mmal_vchiq: vchiq_mmal_submit_buffer: vc_sm_import_dmabuf_fd failed, ret -512
[59407.677437] bcm2835-codec bcm2835-codec: device_run: Failed submitting ip buffer
[59407.696712] ------------[ cut here ]------------
[59407.696731] WARNING: CPU: 1 PID: 3746 at drivers/media/common/videobuf2/videobuf2-core.c:2024 __vb2_queue_cancel+0x220/0x2a0 [videobuf2_common]
[59407.696775] Modules linked in: rfcomm snd_seq_dummy snd_hrtimer snd_seq snd_seq_device cmac algif_hash aes_arm64 aes_generic algif_skcipher af_alg bnep hci_uart btbcm brcmfmac_wcc bluetooth brcmfmac brcmutil cfg80211 ov5647 ecdh_generic ecc bcm2835_unicam libaes rfkill v4l2_dv_timings v4l2_fwnode raspberrypi_hwmon v4l2_async binfmt_misc bcm2835_codec(C) bcm2835_v4l2(C) rpivid_hevc(C) bcm2835_isp(C) v4l2_mem2mem bcm2835_mmal_vchiq(C) videobuf2_vmalloc videobuf2_dma_contig videobuf2_memops videobuf2_v4l2 raspberrypi_gpiomem videodev vc_sm_cma(C) videobuf2_common snd_bcm2835(C) mc nvmem_rmem uio_pdrv_genirq uio i2c_dev fuse dm_mod ip_tables x_tables ipv6 rtc_ds1307 spidev regmap_i2c vc4 snd_soc_hdmi_codec drm_display_helper cec v3d drm_dma_helper gpu_sched i2c_mux_pinctrl drm_shmem_helper i2c_mux drm_kms_helper i2c_brcmstb spi_bcm2835 drm i2c_bcm2835 drm_panel_orientation_quirks snd_soc_core snd_compress snd_pcm_dmaengine snd_pcm snd_timer snd backlight
[59407.696947] CPU: 1 PID: 3746 Comm: python3.11 Tainted: G C 6.6.31+rpt-rpi-v8 #1 Debian 1:6.6.31-1+rpt1
[59407.696955] Hardware name: Raspberry Pi 4 Model B Rev 1.5 (DT)
[59407.696958] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[59407.696964] pc : __vb2_queue_cancel+0x220/0x2a0 [videobuf2_common]
[59407.696980] lr : __vb2_queue_cancel+0x38/0x2a0 [videobuf2_common]
[59407.696994] sp : ffffffc0831f3a70
[59407.696997] x29: ffffffc0831f3a70 x28: ffffff80ba3887f8 x27: 0000000000000009
[59407.697006] x26: 0000000000000001 x25: 00000000f7b98094 x24: ffffff80ba388600
[59407.697015] x23: ffffff80401c92e0 x22: ffffff80bae57a98 x21: ffffff8041e3cd30
[59407.697024] x20: ffffff80bae57b40 x19: ffffff80bae57a98 x18: 0000000000000003
[59407.697032] x17: 0000000000000000 x16: ffffffd7102d4348 x15: ffffffc0831f35c0
[59407.697039] x14: 0000000000000004 x13: ffffff8043740028 x12: 0000000000000000
[59407.697047] x11: ffffff80b57462f8 x10: ffffff80b5746238 x9 : ffffffd71032baf8
[59407.697055] x8 : ffffffc0831f3970 x7 : 0000000000000000 x6 : 0000000000000228
[59407.697063] x5 : ffffff80459f4e40 x4 : fffffffe01167d20 x3 : 0000000080150013
[59407.697071] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000001
[59407.697079] Call trace:
[59407.697085] __vb2_queue_cancel+0x220/0x2a0 [videobuf2_common]
[59407.697101] vb2_core_queue_release+0x2c/0x60 [videobuf2_common]
[59407.697115] vb2_queue_release+0x18/0x30 [videobuf2_v4l2]
[59407.697136] v4l2_m2m_ctx_release+0x30/0x50 [v4l2_mem2mem]
[59407.697164] bcm2835_codec_release+0x64/0x110 [bcm2835_codec]
[59407.697178] v4l2_release+0xec/0x100 [videodev]
[59407.697282] __fput+0xbc/0x288
[59407.697292] ____fput+0x18/0x30
[59407.697296] task_work_run+0x80/0xe0
[59407.697306] do_exit+0x30c/0x988
[59407.697311] do_group_exit+0x3c/0xa0
[59407.697315] get_signal+0x980/0x9b0
[59407.697320] do_notify_resume+0x318/0x1370
[59407.697325] el0_svc_compat+0x78/0x88
[59407.697335] el0t_32_sync_handler+0x98/0x140
[59407.697341] el0t_32_sync+0x194/0x198
[59407.697346] ---[ end trace 0000000000000000 ]---
[59407.697353] videobuf2_common: driver bug: stop_streaming operation is leaving buf 00000000680707d2 in active #state
So far, I have not been able to drive the bug with a minimal version that just features the video capture. But the bug is intermittent, so maybe I just haven't run the test code long enough for it to crop up. I will kept testing.
@djhanove, are you able to share how you handle it using the watchdog thread?
@dmunnet something like this, I'll let you figure out the rest
import time
from picamera2 import Picamera2
import threading
import logging
# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class CameraManager:
def __init__(self):
self.picam2 = Picamera2()
capture_config = self.picam2.create_video_configuration(
main={"size": (640, 480), "format": "RGB888"},
raw={"size": self.picam2.sensor_resolution},
buffer_count=6,
queue=True
)
self.picam2.configure(capture_config)
self.picam2.start(show_preview=False)
self.heatbeat_lock = threading.Lock()
self.last_heartbeat = time.time()
self.watchdog_thread = threading.Thread(target=self._watchdog)
self.watchdog_thread.start()
self.camera_thread = threading.Thread(target=self._capture_continuous)
self.camera_thread.start()
def _watchdog(self, timeout=10): # Timeout in seconds
while True:
with self.heatbeat_lock:
last_heartbeat = self.last_heartbeat
if time.time() - last_heartbeat > timeout:
logger.error("Camera capture thread unresponsive, attempting to reset.")
time.sleep(timeout)
def _capture_continuous(self):
while True:
try:
frame = self.picam2.capture_array()
self.signal_heartbeat() # Signal that the thread is alive
except Exception as e:
logger.error(f"Failed to capture frame: {e}")
break
time.sleep(1 / 25)
def signal_heartbeat(self):
with self.heatbeat_lock:
self.last_heartbeat = time.time()
if __name__ == "__main__":
camera_manager = CameraManager()
while True:
time.sleep(1)
@djhanove, thanks for this! If I learn more about this issue in the course of my testing, I'll post an update.
I can also confirm I've seen what appears to be the same problem. While developing updated Python code using PiCamera2 to support the Pi Camera Module 3 as part of the "My Naturewatch Camera Server" I ran into this issue. I'm using a Pi Zero 2W and ended up testing on two separate Pi Zero 2Ws, both of which are running identical code and both have the problem. Interestingly I tested the same code using an older Pi Camera module and couldn't reproduce the problem.
I tried to find the simplest code that would reproduce the problem, but given that the failure can occur anywhere between 1-7 days on average it's not been easy to say for sure whether the problem is present in test code or not. The app I've been developing also uses the H.264 encoder using a circular buffer to write out to a file once motion has been detected. Usually when the problem occurs I see "imported vc_sm_cma_get_buffer failed -512
" written in the dmesg log, but often this is only after I've restarted the Python process (the date stamp still shows the time the application stopped functioning). If I carry out the command "libcamera-hello --list-devices" once the system has entered the failed state it will hang indefinitely. I ended up creating a watchdog to restart the Python process instead as a workaround. There's clearly a problem, and I'd come across the same issues that others have linked to above, but in each case those seemed to be either related to a few kernel versions that had a separate problem, or were caused by faulty SD cards.
I shall keep an eye on this thread with interest as it would be good to resolve this issue properly.
@caracoluk I have around 55 3b+'s with V1 camera modules where this doesn't happen (obviously older kernel and firmware) I didn't think that it could be the camera module version, but you may be onto something. We swapped to pi4's and the v3 camera module at the same time and it is only happening on our newer builds.
@caracoluk I have around 55 3b+'s with V1 camera modules where this doesn't happen (obviously older kernel and firmware) I didn't think that it could be the camera module version, but you may be onto something. We swapped to pi4's and the v3 camera module at the same time and it is only happening on our newer builds.
I did try lowering the frame rate to 15fps (from 25fps), thinking that the issue might be related to the increased resources required handling the larger frames from the Camera module 3. That didn't seem to make any difference. I enabled debugging using the "vclog -m" and "vclog -a" commands and nothing at all was shown in these logs when the problem occurs. The only logging I've found at all is in dmesg and from some of the other threads it seems like these are just symptoms of the problem and offer little insight into the underlying cause. I realise it's not going to be easy for the Pi developers to resolve without them having a means of easily/quickly reproducing the problem.
@caracoluk I ran with 4 threads on max stress test for days at a time to try and induce it and also did not see increased frequency.
agreed, not an easy one to debug
I can confirm that. Tried also max. cpu stress without difference. For memory, I have 2/3 of cma and half of system memory free while the application is running.
@caracoluk I have around 55 3b+'s with V1 camera modules where this doesn't happen (obviously older kernel and firmware) I didn't think that it could be the camera module version, but you may be onto something. We swapped to pi4's and the v3 camera module at the same time and it is only happening on our newer builds.
I did try lowering the frame rate to 15fps (from 25fps), thinking that the issue might be related to the increased resources required handling the larger frames from the Camera module 3. That didn't seem to make any difference. I enabled debugging using the "vclog -m" and "vclog -a" commands and nothing at all was shown in these logs when the problem occurs. The only logging I've found at all is in dmesg and from some of the other threads it seems like these are just symptoms of the problem and offer little insight into the underlying cause. I realise it's not going to be easy for the Pi developers to resolve without them having a means of easily/quickly reproducing the problem.
Looking back at my alert logs, I did have some pi 4b's with V1 camera modules that also exhibited this so it is not exclusive to the v3 module.
@djhanove that's useful to know. In that case this problem is likely to be affecting quite a few people. Did you notice whether the pi's with the V1 camera modules had been running for a lot longer than those with the V3 modules before they failed?
@djhanove that's useful to know. In that case this problem is likely to be affecting quite a few people. Did you notice whether the pi's with the V1 camera modules had been running for a lot longer than those with the V3 modules before they failed?
Didn't seem to make a difference for me. I just swapped some back to the V1 camera yesterday and had 2 lockups in 24 hours on one device.
What I have noticed is that I get the lockups more frequently if the Pi's are running at a higher temperature. Over the warmer weeks (when the CPU temperature was showing temperatures of around 72-80C) I'd see the lockups on average once every 2 days. Now it's a bit cooler, the CPU temperature is showing 65-70C, and haven't had a lockup on either of my Pi's for the last 5 days so far. That might explain why when I was trying to create the shortest segment of code to reproduce the problem I was failing to do so. Not because the problem was no longer present, but because it would take longer to occur. I see that some of you have run CPU stress tests to try and cause the problem, and that should have pushed the temperature up a fair bit, so it's difficult to say if this is just a coincidence or not.
@caracoluk all of my devices have heat sinks on them which keeps the temps in the 60-65C range even under full load.
I am experiencing the same issue on a CM4 running up to date bookworm. It never showed these errors while running the same picamera2 apps on bullseye. I have a raspberry pi camera module 3 on one port and an ov5647 on the other. The same fault occurs when running with only the camera module 3, although less frequently.
Headless CM4 running 6.6.31-1+rpt1 (2024-05-29) aarch64. Fresh install then sudo apt update, sudo apt upgrade.
Official Raspberry pi 5 power supply 25W. vcgencmd get_throttled always gives 0x0.
Both camera cables short - about 15cm. Possible source of radio interference from WiFi antenna connected to the CM4 running hostapd for local WiFi hotspot.
I'm working on a minimal program that exhibits the fault. I am also trying to see if the WiFI hotspot makes any difference (so far no obvious difference). I will also try more cooling on the CM4.
The program is based on the very wonderful picamera2 examples. Combination of mjpeg_server, capture_circular, capture_video_multiple together with opencv object detection. Most of the time each instance uses less than 30% of 1 cpu, so top rarely shows less than 80% idle. In normal use no swap is used. 1400MBytes available on 2GByte CM4. Normally there are no clients for the http mjpeg servers - and no correlation with the occurrence of the fault.
Every few days one or other of the picamera2 processes hangs - no frames received. I use systemd notify with a 10sec watchdog to kill and relaunch. Most of the time after a hang, systemd managed to relaunch successfully. Sometimes it throws kernel errors as reported above.
vc_sm_cma_vchi_rx_ack: received response 78529294, throw away...
Once or twice both processes have stalled at the same time (both receiving no frames) and that seems to lead to a disaster. Systemd can't kill either process, so stale zombies pile up and eventually the 2GByte swapfile is all used up. The only way out is to reboot.
I'll keep you posted.
I have same issue.
I am using two cameras for MJPEG live streaming. The problem has not occurred with Bullseye and has been a problem since the migration to Bookworm.
The lockups seem to happen in half a day to a day. I restart the script every hour as a workaround.
I have obtained the kernel error logs and have uploaded them to the following gist URL. https://gist.github.com/Akkiesoft/4400a32e4488bf58b7ce891781019399 https://gist.github.com/Akkiesoft/95e9cfcd24023dd9b3b8009fc07a472c
Actually I am finding it hard to narrow this down.
I can rule out the possible interference of the WiFi hotspot - the errors turn up the same with it switched off.
The frame rate makes a big difference. With both processes running at 20 frames per second, the errors come every few hours on both cameras, and the disaster (both processes hanging and failing to be killed, so runs out of memory) happens after a few days. With one process running at 10 frames per second and the other at 20, the errors are less frequent and with both processes at 10 frames per second the errors are much less frequent (and so far no disasters).
Still working on it...
@lowflyerUK I find that my app runs between 2-10 days without failing with the frame rate set to 20fps and a resolution of 1920x1080 using the camera module 3 on a Pi Zero 2W. I have two duplicate sets of hardware and both seem to exhibit the problem to the same degree. I did notice that on hot days the lock ups occur more frequently as I'm only using a heatsink with no active cooling. The reported CPU temperature has gone above 80C on hot days which seems to reduce the mean time between failures for me.
@caracoluk Thanks for your comment. The cpu in this CM4 does run rather warm - between 70-80deg, although it runs normally with 90% idle. Last night I added a permanent stress process that uses 100% of one core. This pushed the temperature to around 83deg so the cpu was occasionally throttled, but no errors for 4 hours - so no dramatic difference (but only for 4 hours). I'll see if I can think of a way to improve the cooling.
Update... I tried with 2 cores stressed. It failed within 2 hours on both camera processes, leading to my previously mentioned disaster, where defunct processes fill up the memory. So temperature could be a contributory factor for me, and/or possibly my picamera2 process fails if the cpu is busy.
@lowflyerUK if we could find a way to reliably reproduce the problem within as short a time frame as possible it would make it easier for the Pi developers to investigate. Ideally we'd need the sample code to be as simple as possible as I see from other threads that is usually their first request when looking into a problem.
@caracoluk Thanks for your encouragement! Yes, that is exactly what I am trying to do.
Hi,
It looks like we are dealing with similar issue with our custom app (https://forums.raspberrypi.com/viewtopic.php?t=359992 ), and we are still debugging. It seems for us that the response received is the repetition of an old message received 128 messages ago. This corresponds to the depth of vchiq_slot_info[] array.
@naroin Many thanks for pointing out that thread. My errors do indeed seem remarkably similar. Typically I get:
Aug 30 02:01:53 nestboxcam3 systemd[1]: gardencam.service: Watchdog timeout (limit 10s)!
Aug 30 02:01:53 nestboxcam3 kernel: vc_sm_cma_vchi_rx_ack: received response 9759796, throw away...
Aug 30 02:01:53 nestboxcam3 kernel: vc_sm_cma_import_dmabuf: imported vc_sm_cma_get_buffer failed -512
after which systemd is able to relaunch the app.
Fully up to date 64 bit Raspberry Pi OS on CM4 with 2GBytes RAM and 2GBytes swap. 2 separate picamera2 processes: One (ov5647) with 2 streams at 10 frames per sec: 320x240 with MJPEGEncoder and 1296x972 with H264Encoder. The other (imx708) with 2 streams at 10 frames per sec: 384x216 with MJPEGEncoder and 1536x864 with H264Encoder.
Most of the time top shows around 80% idle.
If my sums are right, that adds up to around 27MPix/sec for all 4 encodes. So less than half of 1080p30 and about an eighth of the 220MPix/sec that @6by9 told us was the rated maximium for the ISP.
Maybe the CM4 can't reliably encode 4 outputs at once, even at 10 frames per sec? Should I try a Raspberry Pi 5?
I haven't found simple sample code that replicates my issue. So I am inclined to feel that the ISP rate limit is the cause. In my case I think I can make a workaround by only issuing frames to the MJEPencoder whan a client is actually watching the stream. As I am the only client and I hardly ever watch the realtime stream, the probability of failure will be a lot less. This obviously won't be a solution for everybody.
Interestingly my two Pi Zero 2Ws haven't had a lock up in the last 3 weeks after they would previously run for 2-5 days before it happened. There have been up software updates on either of them and I've not changed the code in any way, just left them running in the same positions. The only thing I'm aware of that has changed is the temperature as it has been quite a bit cooler recently. I remember reading somewhere that the Pi starts to throttle the CPU if the temperature increases above 80C and I was seeing temperatures reach this level. Perhaps the CPU throttling makes it more likely for this issue to occur?
I don't think this issue is isolated to picamera2. I'm having the exact same issue with rpicam-vid that I've detailed here. https://forums.raspberrypi.com/viewtopic.php?t=376766
I'd like to try and reproduce this. Does anyone have reasonably simple code that I can run on the latest Raspberry Pi OS on a Pi 4 (and preferably avoiding mediamtx because I know nothing about it) that is likely to provoke the problem? Thanks.
I think no one here has been able to reliably provoke the problem. The issue is still persistent for me across a massive fleet of pi4s with no ability to predictibly reproduce. I have to believe that others are facing the same challenges based on the dialog here. I have a thread running picam to just capture frames, but my program is also running several other tasks in parallel
Hi @davidplowman! Thanks!! I have only ever seen the issue on a CM4 with 2 cameras running. It might be quicker for you to try on a CM4?
For me, it is triggered after a few hours with 2 of these running, with sizes adjusted for the cameras you have capTest1-0.txt
Hope this helps!
All the best!
OK, thanks. Had to mess about with some of the file names, but hopefully not with any material effect. It's running now so I'll leave it and see if anything happens. Is there any obvious sign if things go wrong, or do I need to keep an eye on dmesg?
I've been keeping a close eye on my two Pi Zero 2Ws which were both exhibiting the problem every few days between June and August. I've updated the firmware on both, and I haven't had a problem on either since I did this. One was upgraded a month ago and is running 6.6.51-v8+. The other was upgraded mid-August and is running 6.6.44-v8+. Neither has failed at all since the upgrade and I have not made any code changes in that time. However, I did notice that my Pi Zeros would fail more regularly on the hot summer days when the CPU was in the high 70Cs. It has been cooler over the last couple of months with the CPUs running in the high 60Cs, so I do wonder if this is making a difference. Nothing else has changed at all in this time.
do I need to keep an eye on dmesg?
Hi, we've activacted dynamic debug traces with
echo 'file vc_sm.c +p' > /sys/kernel/debug/dynamic_debug/control
in order to follow buffer messaging.
It seems that a buffer is lost sometimes between the videocore and the kernel.
Pinging @6by9 if he has any possible insights into this.
@naroin can you post some logs (with the vcsm debug enabled) when you hit the problem?
@lowflyerUK I'm running the latest 6.6.51 kernel, have you tried the script you sent me with this kernel? (Just wanting to check the bug is still present!)
I have tried my script with 6.6.56-v8+ kernel (installed by rpi-update), bug still there.
Ok, hands up! The last time I tried it was on 6.6.47+rpt-rpi-v8. I can't update it for a couple of weeks I am afraid. Sorry!
Yes, we can share one example but we have modified the kernel logs to try to investigate the issue...
7,920587,527773748,-;vc_sm_cma_free: dmabuf 3af4fc43, buffer c3ea9700
7,920588,527773939,-;vc_sm_cma_import_dmabuf_internal: importing dma_buf a5075a10/fd -1
7,920589,527774099,-;[vc_sm_cma_import_dmabuf_internal]: attempt to import "sm-host-resource" data - type 1, addr 0x00000000d9d00000, size 3133440, id 39.
7,920590,527774222,-;[vc_sm_cma_import_dmabuf_internal]: requesting import memory (trans_id: 102259)
7,920591,527774240,-;[vc_sm_add_resource]: added buffer 6dfe0219 (name sm-host-resource, size 3133440)
7,920592,527774248,-;[vc_sm_cma_import_dmabuf_internal]: buffer 6dfe0219 (name sm-host-resource, size 3133440), imported 4294967295 vc_handle c000002e
7,920593,527774258,-;vc_sm_cma_import_dmabuf: dma_buf a5075a10/fd -1 imported to ptr 9a7fd358
7,920594,527774312,-;vc_sm_dma_buf_release dmabuf 3af4fc43, buffer 4d404806
7,920595,527774322,-;vc_sm_vpu_free: mem_handle c000002d
7,920596,527774335,-;vc_sm_dma_buf_release vpu_free done
7,920597,527774341,-;vc_sm_clean_up_dmabuf: buffer 4d404806, imported 4294967295
7,920598,527774563,-;vc_sm_dma_buf_release clean_up dmabuf done
7,920599,527774570,-;[vc_sm_release_resource]: buffer 4d404806 (name sm-host-resource, size 3133440), imported 4294967295
7,920600,527774578,-;[vc_sm_release_resource]: Waiting for VPU unmap response for buffer 4d404806 vc_handle c000002d
7,920601,527774585,-;vc_sm_dma_buf_release done
7,920602,527774586,-;vc_sm_vpu_event: Released addr d7600000, size 3133440, id 00000026, mem_handle c000002d
7,920603,527774598,-;[vc_sm_release_resource]: buffer 4d404806 (name sm-host-resource, size 3133440), imported 4294967295
7,920604,527774607,-;vc_sm_release_resource: Release our allocation - done
7,920605,527778251,-;vc_sm_cma_free: dmabuf b0254adc, buffer c3f1de80
7,920606,527778580,-;vc_sm_cma_import_dmabuf_internal: importing dma_buf 35385238/fd -1
7,920607,527778838,-;[vc_sm_cma_import_dmabuf_internal]: attempt to import "sm-host-resource" data - type 1, addr 0x00000000d5f00000, size 2088960, id 38.
7,920608,527779293,-;[vc_sm_cma_import_dmabuf_internal]: requesting import memory (trans_id: 102261)
7,920609,527779393,-;[vc_sm_add_resource]: added buffer 2ab20279 (name sm-host-resource, size 2088960)
7,920610,527779405,-;[vc_sm_cma_import_dmabuf_internal]: buffer 2ab20279 (name sm-host-resource, size 2088960), imported 4294967295 vc_handle c000002d
7,920611,527779415,-;vc_sm_cma_import_dmabuf: dma_buf 35385238/fd -1 imported to ptr 6db0a007
7,920612,527779704,-;vc_sm_dma_buf_release dmabuf b0254adc, buffer 5df0e4a8
7,920613,527779716,-;vc_sm_vpu_free: mem_handle c0000023
7,920614,527779728,-;vc_sm_dma_buf_release vpu_free done
7,920615,527779740,-;vc_sm_clean_up_dmabuf: buffer 5df0e4a8, imported 4294967295
7,920616,527780187,-;vc_sm_dma_buf_release clean_up dmabuf done
7,920617,527780207,-;[vc_sm_release_resource]: buffer 5df0e4a8 (name sm-host-resource, size 2088960), imported 4294967295
7,920618,527780221,-;[vc_sm_release_resource]: Waiting for VPU unmap response for buffer 5df0e4a8 vc_handle c0000023
7,920619,527780235,-;vc_sm_dma_buf_release done
...
7,921721,528330015,-;vc_sm_release_resource: Release our allocation - done
7,921722,528330052,-;vc_sm_cma_import_dmabuf_internal: importing dma_buf 7270b242/fd -1
7,921723,528330259,-;[vc_sm_cma_import_dmabuf_internal]: attempt to import "sm-host-resource" data - type 1, addr 0x00000000d3600000, size 3133440, id 38.
7,921724,528330457,-;[vc_sm_cma_import_dmabuf_internal]: requesting import memory (trans_id: 102385)
7,921725,528330510,-;[vc_sm_add_resource]: added buffer 05d2ba0b (name sm-host-resource, size 3133440)
7,921726,528330519,-;[vc_sm_cma_import_dmabuf_internal]: buffer 05d2ba0b (name sm-host-resource, size 3133440), imported 4294967295 vc_handle c000002d
7,921727,528330528,-;vc_sm_cma_import_dmabuf: dma_buf 7270b242/fd -1 imported to ptr 0a6c9f91
7,921728,528330663,-;vc_sm_dma_buf_release dmabuf 9a7fd358, buffer 9f498656
7,921729,528330678,-;vc_sm_vpu_free: mem_handle c000002b
7,921730,528330693,-;vc_sm_dma_buf_release vpu_free done
7,921731,528330700,-;vc_sm_clean_up_dmabuf: buffer 9f498656, imported 4294967295
7,921732,528331129,-;vc_sm_dma_buf_release clean_up dmabuf done
7,921733,528331136,-;[vc_sm_release_resource]: buffer 9f498656 (name sm-host-resource, size 3133440), imported 4294967295
7,921734,528331145,-;[vc_sm_release_resource]: Waiting for VPU unmap response for buffer 9f498656 vc_handle c000002b
7,921735,528331155,-;vc_sm_dma_buf_release done
7,921736,528331175,-;vc_sm_vpu_event: Released addr d2a00000, size 3133440, id 00000025, mem_handle c000002b
7,921737,528331191,-;[vc_sm_release_resource]: buffer 9f498656 (name sm-host-resource, size 3133440), imported 4294967295
7,921738,528331203,-;vc_sm_release_resource: Release our allocation - done
7,921739,528343688,-;vc_sm_cma_free: dmabuf 3af4fc43, buffer c3ea9a80
7,921740,528343885,-;vc_sm_cma_import_dmabuf_internal: importing dma_buf d756d9a7/fd -1
7,921741,528344062,-;[vc_sm_cma_import_dmabuf_internal]: attempt to import "sm-host-resource" data - type 1, addr 0x00000000d6d00000, size 3133440, id 37.
7,921742,528345597,-;vc_sm_cma_free: dmabuf b0254adc, buffer c3f1df00
7,921743,528346155,-;vc_sm_cma_import_dmabuf_internal: importing dma_buf 56e2da0e/fd -1
7,921744,528346400,-;[vc_sm_cma_import_dmabuf_internal]: attempt to import "sm-host-resource" data - type 1, addr 0x00000000d5b00000, size 2088960, id 40.
3,921745,528346680,-;vc_sm_cma_vchi_rx_ack: received response 102259, throw away...
7,921746,528346829,-;[vc_sm_cma_import_dmabuf_internal]: requesting import memory (trans_id: 102388)
7,921747,528353952,-;[vc_sm_add_resource]: added buffer 5df0e4a8 (name sm-host-resource, size 2088960)
7,921748,528353972,-;[vc_sm_cma_import_dmabuf_internal]: buffer 5df0e4a8 (name sm-host-resource, size 2088960), imported 4294967295 vc_handle c0000023
7,921749,528353987,-;vc_sm_cma_import_dmabuf: dma_buf 56e2da0e/fd -1 imported to ptr 6db0a007
7,921750,528354112,-;vc_sm_dma_buf_release dmabuf b0254adc, buffer e456977f
7,921751,528354126,-;vc_sm_vpu_free: mem_handle c000002e
7,921752,528354138,-;vc_sm_dma_buf_release vpu_free done
7,921753,528354146,-;vc_sm_clean_up_dmabuf: buffer e456977f, imported 4294967295
In this example the duplicated response is : vc_sm_cma_vchi_rx_ack: received response 102259, throw away... And it seems to appear when two buffers are requested in a short period of time.
I think I've managed to reproduce this now, and I see the same vc_sm_cma_vchi_rx_ack: received response 127885, throw away...
I have no idea if this message is meant to be fatal, but things seem to die very soon after.
[ 1161.598289] vc_sm_cma_free: handle 00000000363ec871/dmabuf 00000000363ec871
[ 1161.598572] vc_sm_cma_import_dmabuf_internal: importing dma_buf 000000006f5ccf39/fd -1
[ 1161.598732] [vc_sm_cma_import_dmabuf_internal]: attempt to import "sm-host-resource" data - type 1, addr 0x00000000d0200000, size 3112960.
[ 1161.598791] vc_sm_cma_import_dmabuf_internal: importing dma_buf 0000000080f07d10/fd -1
[ 1161.598810] [vc_sm_cma_import_dmabuf_internal]: attempt to import "sm-host-resource" data - type 1, addr 0x00000000ce0c8000, size 32768.
[ 1161.598832] vc_sm_cma_vchi_rx_ack: received response 127885, throw away...
[ 1161.598914] [vc_sm_add_resource]: added buffer 0000000009e567f0 (name sm-host-resource, size 32768)
[ 1161.598924] vc_sm_cma_import_dmabuf: imported to ptr 00000000b0f09765
[ 1161.599015] vc_sm_vpu_event: Released addr ce0c8000, size 32768, id 0000001a, mem_handle c0000020
[ 1161.599029] [vc_sm_release_resource]: buffer 0000000072e4c532 (name sm-host-resource, size 32768), imported 1
[ 1161.599036] vc_sm_release_resource: Release our allocation - done
[ 1161.599047] vc_sm_cma_free: handle 00000000b0f09765/dmabuf 00000000b0f09765
[ 1161.599088] vc_sm_dma_buf_release dmabuf 00000000b0f09765, buffer 0000000009e567f0
[ 1161.599097] vc_sm_dma_buf_release vpu_free done
[ 1161.599107] vc_sm_dma_buf_release clean_up dmabuf done
[ 1161.599110] [vc_sm_release_resource]: buffer 0000000009e567f0 (name sm-host-resource, size 32768), imported 1
[ 1161.599115] [vc_sm_release_resource]: Waiting for VPU unmap response on 0000000009e567f0
[ 1161.599119] vc_sm_dma_buf_release done
[ 1161.631760] vc_sm_cma_import_dmabuf_internal: importing dma_buf 0000000080f07d10/fd -1
[ 1161.631828] [vc_sm_cma_import_dmabuf_internal]: attempt to import "sm-host-resource" data - type 1, addr 0x00000000ce0c8000, size 32768.
[ 1161.631977] [vc_sm_add_resource]: added buffer 000000003a9099b5 (name sm-host-resource, size 32768)
[ 1161.631989] vc_sm_cma_import_dmabuf: imported to ptr 00000000c5b2d219
[ 1161.632083] vc_sm_vpu_event: Released addr ce0c8000, size 32768, id 0000001d, mem_handle c0000010
[ 1161.632097] [vc_sm_release_resource]: buffer 0000000009e567f0 (name sm-host-resource, size 32768), imported 1
[ 1161.632105] vc_sm_release_resource: Release our allocation - done
[ 1161.632118] vc_sm_cma_free: handle 00000000c5b2d219/dmabuf 00000000c5b2d219
[ 1161.632165] vc_sm_dma_buf_release dmabuf 00000000c5b2d219, buffer 000000003a9099b5
[ 1161.632176] vc_sm_dma_buf_release vpu_free done
[ 1161.632190] vc_sm_dma_buf_release clean_up dmabuf done
[ 1161.632193] [vc_sm_release_resource]: buffer 000000003a9099b5 (name sm-host-resource, size 32768), imported 1
[ 1161.632197] [vc_sm_release_resource]: Waiting for VPU unmap response on 000000003a9099b5
[ 1161.632203] vc_sm_dma_buf_release done
[ 1161.663490] vc_sm_cma_import_dmabuf_internal: importing dma_buf 0000000080f07d10/fd -1
[ 1161.663534] [vc_sm_cma_import_dmabuf_internal]: attempt to import "sm-host-resource" data - type 1, addr 0x00000000ce0c8000, size 32768.
[ 1161.663649] [vc_sm_add_resource]: added buffer 0000000072e4c532 (name sm-host-resource, size 32768)
[ 1161.663657] vc_sm_cma_import_dmabuf: imported to ptr 00000000b0f09765
[ 1161.663733] vc_sm_vpu_event: Released addr ce0c8000, size 32768, id 0000001a, mem_handle c0000020
[ 1161.663744] [vc_sm_release_resource]: buffer 000000003a9099b5 (name sm-host-resource, size 32768), imported 1
[ 1161.663750] vc_sm_release_resource: Release our allocation - done
[ 1161.663761] vc_sm_cma_free: handle 00000000b0f09765/dmabuf 00000000b0f09765
[ 1161.663794] vc_sm_dma_buf_release dmabuf 00000000b0f09765, buffer 0000000072e4c532
[ 1161.663803] vc_sm_dma_buf_release vpu_free done
[ 1161.663812] vc_sm_dma_buf_release clean_up dmabuf done
[ 1161.663816] [vc_sm_release_resource]: buffer 0000000072e4c532 (name sm-host-resource, size 32768), imported 1
[ 1161.663821] [vc_sm_release_resource]: Waiting for VPU unmap response on 0000000072e4c532
[ 1161.663827] vc_sm_dma_buf_release done
[ 1161.696589] vc_sm_cma_import_dmabuf_internal: importing dma_buf 0000000080f07d10/fd -1
[ 1161.696636] [vc_sm_cma_import_dmabuf_internal]: attempt to import "sm-host-resource" data - type 1, addr 0x00000000ce0c8000, size 32768.
[ 1161.696749] [vc_sm_add_resource]: added buffer 0000000009e567f0 (name sm-host-resource, size 32768)
[ 1161.696756] vc_sm_cma_import_dmabuf: imported to ptr 00000000c5b2d219
[ 1161.696832] vc_sm_vpu_event: Released addr ce0c8000, size 32768, id 0000001d, mem_handle c0000010
[ 1161.696843] [vc_sm_release_resource]: buffer 0000000072e4c532 (name sm-host-resource, size 32768), imported 1
[ 1161.696850] vc_sm_release_resource: Release our allocation - done
[ 1161.696860] vc_sm_cma_free: handle 00000000c5b2d219/dmabuf 00000000c5b2d219
[ 1161.696900] vc_sm_dma_buf_release dmabuf 00000000c5b2d219, buffer 0000000009e567f0
[ 1161.696915] vc_sm_dma_buf_release vpu_free done
[ 1161.696934] vc_sm_dma_buf_release clean_up dmabuf done
[ 1161.696936] [vc_sm_release_resource]: buffer 0000000009e567f0 (name sm-host-resource, size 32768), imported 1
[ 1161.696942] [vc_sm_release_resource]: Waiting for VPU unmap response on 0000000009e567f0
[ 1161.696947] vc_sm_dma_buf_release done
[ 1161.729894] vc_sm_cma_import_dmabuf_internal: importing dma_buf 0000000080f07d10/fd -1
[ 1161.729936] [vc_sm_cma_import_dmabuf_internal]: attempt to import "sm-host-resource" data - type 1, addr 0x00000000ce0c8000, size 32768.
[ 1161.730103] [vc_sm_add_resource]: added buffer 000000003a9099b5 (name sm-host-resource, size 32768)
[ 1161.730115] vc_sm_cma_import_dmabuf: imported to ptr 00000000b0f09765
[ 1161.730618] vc_sm_vpu_event: Released addr ce0c8000, size 32768, id 0000001a, mem_handle c0000020
[ 1161.730631] [vc_sm_release_resource]: buffer 0000000009e567f0 (name sm-host-resource, size 32768), imported 1
[ 1161.730638] vc_sm_release_resource: Release our allocation - done
[ 1161.730654] vc_sm_cma_free: handle 00000000b0f09765/dmabuf 00000000b0f09765
[ 1161.730694] vc_sm_dma_buf_release dmabuf 00000000b0f09765, buffer 000000003a9099b5
[ 1161.730708] vc_sm_dma_buf_release vpu_free done
[ 1161.730735] vc_sm_dma_buf_release clean_up dmabuf done
[ 1161.730738] [vc_sm_release_resource]: buffer 000000003a9099b5 (name sm-host-resource, size 32768), imported 1
[ 1161.730744] [vc_sm_release_resource]: Waiting for VPU unmap response on 000000003a9099b5
[ 1161.730749] vc_sm_dma_buf_release done
I'm slightly surprised that you're mapping and unmapping as many buffers as that. Ideally you want the same number of buffers on the output of libcamera as being fed to the encoder, so the buffers get mapped once and then the mappings retained for the duration. Mapping and unmapping every time will have a performance impact.
The dmabuf is meant to be held on to until the VPU has responded saying that it has unmapped it. The log you're seeing means that we've got a response for a message that isn't expected, and I don't know how you've managed that.
Actually the size of the allocation is only 32kB. That's not an image buffer, so what is it? Lens shading table?
const uint32_t MaxLsGridSize = 0x8000;
The lstable is indeed 32k.
By mapping/unmapping do you mean calling mmap/munmap from userland? That only happens once per configure() cycle in the IPA. Perhaps the 32k LS table size is a coincidence and it's actually another buffer? If it was the LS table, I expect we can reproduce this without running the encoder.
However, this happens on every frame in the kernel driver ctrl handler:
case V4L2_CID_USER_BCM2835_ISP_LENS_SHADING:
{
struct bcm2835_isp_lens_shading *v4l2_ls;
struct mmal_parameter_lens_shading_v2 ls;
struct dma_buf *dmabuf;
void *vcsm_handle;
v4l2_ls = (struct bcm2835_isp_lens_shading *)ctrl->p_new.p_u8;
/*
* struct bcm2835_isp_lens_shading and struct
* mmal_parameter_lens_shading_v2 match so that we can do a
* simple memcpy here.
* Only the dmabuf to the actual table needs any manipulation.
*/
memcpy(&ls, v4l2_ls, sizeof(ls));
dmabuf = dma_buf_get(v4l2_ls->dmabuf);
if (IS_ERR_OR_NULL(dmabuf))
return -EINVAL;
ret = vc_sm_cma_import_dmabuf(dmabuf, &vcsm_handle);
if (ret) {
dma_buf_put(dmabuf);
return -EINVAL;
}
ls.mem_handle_table = vc_sm_cma_int_handle(vcsm_handle);
if (ls.mem_handle_table)
/* The VPU will take a reference on the vcsm handle,
* which in turn will retain a reference on the dmabuf.
* This code can therefore safely release all
* references to the buffer.
*/
ret = set_isp_param(node,
MMAL_PARAMETER_LENS_SHADING_OVERRIDE,
&ls,
sizeof(ls));
else
ret = -EINVAL;
vc_sm_cma_free(vcsm_handle);
dma_buf_put(dmabuf);
break;
}
Perhaps this is not right?
An experiment might be to change "rpi.alsc"
in the tuning file to "x.rpi.alsc"
(i.e. comment it out), and see if the problem goes away? (BTW, the Python script I was running overnight seems to be OK still.)
Perhaps it's better to do something like this in the isp driver: https://github.com/naushir/linux/commit/de2c0b348998fee9b8ce76c09c574b0c87fb6c92?
Completely untested!
The log you're seeing means that we've got a response for a message that isn't expected, and I don't know how you've managed that.
we do not know either how we've done that, but that unexpected message is having a trans_id index equaling the expected (lost message) trans_id index minus 128. And 128 matches the width of vchiq_slot_info[] circular buffer. Please note that we added a local variable to get the vc_sm_cma_vchi_import() trans_id as in our case and naushir's case two vc_sm_cma_import_dmabuf_internal() occur at the same time :
@@ -721,6 +731,7 @@ vc_sm_cma_import_dmabuf_internal(struct vc_sm_privdata_t *private,
struct sg_table *sgt = NULL;
dma_addr_t dma_addr;
u32 cache_alias;
+ u32 trans_id;
int ret = 0;
int status;
@@ -783,21 +794,23 @@ vc_sm_cma_import_dmabuf_internal(struct vc_sm_privdata_t *private,
__func__, import.name, import.type, &dma_addr, import.size);
/* Allocate the videocore buffer. */
- status = vc_sm_cma_vchi_import(sm_state->sm_handle, &import, &result,
- &sm_state->int_trans_id);
+ status = vc_sm_cma_vchi_import(sm_state->sm_handle, &import, &result, &trans_id);
if (status == -EINTR) {
pr_debug("[%s]: requesting import memory action restart (trans_id: %u)\n",
- __func__, sm_state->int_trans_id);
+ __func__, trans_id);
ret = -ERESTARTSYS;
private->restart_sys = -EINTR;
private->int_action = VC_SM_MSG_TYPE_IMPORT;
+ private->int_trans_id = trans_id;
goto error;
} else if (status || !result.res_handle) {
pr_debug("[%s]: failed to import memory on videocore (status: %u, trans_id: %u)\n",
- __func__, status, sm_state->int_trans_id);
+ __func__, status, trans_id);
ret = -ENOMEM;
goto error;
}
+ pr_debug("[%s]: requesting import memory (trans_id: %u)\n",
+ __func__, trans_id);
mutex_init(&buffer->lock);
INIT_LIST_HEAD(&buffer->attachments);
Perhaps it's better to do something like this in the isp driver: naushir/linux@de2c0b3?
Completely untested!
Is libcamera resubmitting the same dmabuf every frame? The VPU is mapping the dmabuf, but I suspect that the set call makes it reprocess the table.
Your patch is close, but we actually want to store/compare the struct dmabuf *
from dma_buf_get
, otherwise a dup
on the fd will give us a new fd for the same dmabuf.
I did wonder if we were running out of buffering in VCHI. With the ISP and encoding going on it is shovelling a fair number of commands around the place.
Yes, libcamera uses a single dmabuf handle for the ls table. I'll rework the change on Monday and give it some testing.
OK, thanks. Had to mess about with some of the file names, but hopefully not with any material effect. It's running now so I'll leave it and see if anything happens. Is there any obvious sign if things go wrong, or do I need to keep an eye on dmesg?
@davidplowman Sorry I missed this. Mine throws this, then one of the threads never gets any more frames.
Aug 30 02:01:53 nestboxcam3 kernel: vc_sm_cma_vchi_rx_ack: received response 9759796, throw away...
Aug 30 02:01:53 nestboxcam3 kernel: vc_sm_cma_import_dmabuf: imported vc_sm_cma_get_buffer failed -512
I think this script is even quicker at causing it. Yes,sorry about my filenames. capTest0.txt
I've got an updated fix for the LS dmabuf caching. I seem to have improved behavior on my system so far, but would be nice to have it tested by other folks.
The change is here: https://github.com/raspberrypi/linux/pull/6429. Once the CI has run on the commit, you can pull the kernel with
sudo rpi-update pulls/6429
Running this currently on a unit that has frequently been rebooting -- will keep you posted after a few days. I have some stats on number of lockups prior to the kernel patch.
During the development of an application, which is streaming continuously using picamera2, I can not get rid of a bug, leading to a lock up of picamera2 which stops receiving buffers after multiple hours of runtime.
I tried to reproduce this with a minimal example on a pi4 with a current Raspberry Pi OS, but without success yet. Thus I hesitated for a long time to file a bug report, but I have several indices and found multiple similar issues, so I decided to collect my findings here, perhaps to receive some help or help others seeing the same symptoms.
I know can not really ask for help until I have a working minimal example on a current RPi OS, but I am happy to collect more debugging information which could help tracking this down or finding other impacted users who can share other relevant findings.
To Reproduce
Minimal example
I created a minimal example, which encodes the main and the lores streams to h264 and forwards them to ffmpeg outputs. ffmpeg sends them via rtsp to a https://github.com/bluenviron/mediamtx instance, which includes a webserver to open the stream in a browser using webrtc. After multiple hours, usually in the range of 10-20h, mediamtx logs show that the rtsp streams stopped. While debugging, I could see that the picamera2 event loops only contained empty events from this time on.
Context
This minimal example is a reduced version of my more complex application, which is running on a custom yocto build on a cm4 (currently on the cm4io board) with pi camera v3.
Currently, I can not reproduce the bug on a pi4 with this minimal example. But I can reproduce the bug with
Although I cannot prove, I think the bug is present in Raspberry Pi OS too, but subtile differences lead to a much longer runtime than on my custom build. During the last months, I observed that multiple factors changed the time to fail:
On my custom build, I try to follow Raspberry Pi OS versions of related software as close as possible, currently:
Symptoms
As described, when the failure happens, the outputs stop outputting frames. picamera2 stops supplying raw frames to the encoders.
dmesg
In most cases, shows that some kind of v4l2 frame polling method seems to block forever. The same message appears twice at the same time for two different python tasks, presumably for each stream.
Sometimes, but more rarely, there is nothing in dmesg.
Sometimes I see a
load
Looking at top, I saw a load of 2.00, but no processes showing more than a few % of cpu. A quick research lead to IO workload as a possible cause, perhaps the never returning v4l2 poll.
Resolving the situation
After killing and restarting the application, it works again for a few hours. A reboot does not change anything.
Related issues
Lots of research lead me to related issues, having their origin perhaps in the same, hard to track issue.
1086 finally made me to open up this issue