raspberrypi / picamera2

New libcamera based python library
BSD 2-Clause "Simplified" License
889 stars 188 forks source link

[BUG] No exit from function capture_request() #1086

Open tutoduino opened 3 months ago

tutoduino commented 3 months ago

Please only report one bug per issue!

Describe the bug Creating a time-lapse, I am capturing a large number of frames with a delay of 10 seconds between each capture. Sometime capture_request() function takes an unexpected amount of time to return (>100 seconds) or even never exit. I observe that the CPU is 100% (see attached htop) in this case.

To Reproduce The problem happens randomly after several tens of captures, but it is easy to reproduce.

Expected behaviour Almost constant capture time (in seconds) for capture_request() function call

Console Output, Screensho htop ts If applicable, any console output or screenshots that show the problem and associated error messages.

Hardware : Raspberry Pi Zero 2 W and camera v3 Raspberry Pi. Happens with several updated OS. For example Raspberry Pi OS Lite (32-bit)

Additional context The code and steps are available in my tutorial here : https://tutoduino.fr/tutoriels/raspberry-timelapse/

davidplowman commented 3 months ago

Hi, thanks for the report. I set this running overnight as described and indeed, it locked up with 2 Python processes at 100%. Very strange. The bad news is that it took 16 hours to fail! Anyway, so I'll have to gradually start adding some instrumentation, perhaps cut the delay from 10 to 5 seconds, and hope the next run gets a bit "luckier"...

tutoduino commented 3 months ago

I added debug traces. We see normal behavior first with 10 sec between pictures 136,137,138. But then an interval of 110 seconds for next picture 139. Capture time is correct, no specific debug trace to help understanding what happens :

Capture saved: image00136.jpg at time 1371.8 picamera2.picamera2 DEBUG: Execute job: <picamera2.job.Job object at 0x665239b0> picamera2.request INFO: Saved <picamera2.request.Helpers object at 0x66518fb0> to file image00137.jpg. picamera2.request INFO: Time taken for encode: 517.1655340000143 ms. Capture saved: image00137.jpg at time 1381.8 picamera2.picamera2 DEBUG: Execute job: <picamera2.job.Job object at 0x66523990> picamera2.request INFO: Saved <picamera2.request.Helpers object at 0x66518fb0> to file image00138.jpg. picamera2.request INFO: Time taken for encode: 517.6941099998658 ms. Capture saved: image00138.jpg at time 1391.8 picamera2.picamera2 DEBUG: Execute job: <picamera2.job.Job object at 0x66523990> picamera2.request INFO: Saved <picamera2.request.Helpers object at 0x66518fb0> to file image00139.jpg. picamera2.request INFO: Time taken for encode: 517.8634320000128 ms. Capture saved: image00139.jpg at time 1500.4 Timing issue !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

nzottmann commented 3 months ago

Could you look at dmesg -T output after the lockup if there are any messages at the point of time the script stopped working?

davidplowman commented 3 months ago

HI @nzottmann , I did try that actually when mine locked up (seeing as it took 16 hours I had a good snoop around before rebooting), and there was absolutely nothing there. Turning on debug in the camera CSI2 receiver showed that the camera was still running but not filling any buffers. Quite why anything would be running at 100% is a bit of a mystery. I found I couldn't even kill the processes, and had to pull the power cord out.

Interestingly, it just locked up again for me, for about 1 minute. I noticed that htop was reporting 100% for those two processes again, so the symptoms look identical to the "permanent" lock-up. Except that after a minute it just carried on as if nothing had happened.

tutoduino commented 3 months ago

Maybe not linked, but calling dmesg -T in a ssh session created a freeze of other ssh session that was running the capture program (at capture nb 669)... But the frames are still captured (I see pictures recorded correctly until capture nb 730). Something weird with ssh since it is running at 100%... The capture program stop capturing after capture 730,
No information in dmesg -T htop2

tutoduino commented 3 months ago

Not able to reproduce yet on bullseye (Operating System: Raspbian GNU/Linux 11 (bullseye), Kernel: Linux 6.1.21-v7+), it looks much more stable under bullseye when another ssh session is opened in parallel.

Version of libcamera on Bullseye :

pi@raspi2w:~ $ libcamera-hello --version
libcamera-apps build: 7e4d3d71867f 17-07-2023 (07:34:37)
libcamera build: v0.0.5+83-bde9b04f

Version of libcamera on Bookworm :

pi@raspi2w:~ $ libcamera-hello --version
rpicam-apps build: 49344f2a8d18 17-06-2024 (12:19:10)
libcamera build: v0.3.0+65-6ddd79b5
davidplowman commented 2 months ago

When I've been running this and watching it with htop, I frequently find that htop appears to lock up, and go to 100% cpu, while the camera carries on. So it's starting to feel like there's something more general going wrong. I'll keep investigating.

davidplowman commented 2 months ago

As far as I can tell, this is related to the Linux kernel version. Everything with a 6.1 kernel is fine, and everything with a 6.6 kernel is not. I think it's something to do with how the kernel swaps between processes when resources are getting tight. I don't think it's camera related at all, except insofar as the camera eats system resources in such a way as to provoke problems. This is all several light years outside any areas of expertise that I have, so I'm going to have to ask around.

tutoduino commented 2 months ago

As far as I can tell, this is related to the Linux kernel version. Everything with a 6.1 kernel is fine, and everything with a 6.6 kernel is not. I think it's something to do with how the kernel swaps between processes when resources are getting tight. I don't think it's camera related at all, except insofar as the camera eats system resources in such a way as to provoke problems. This is all several light years outside any areas of expertise that I have, so I'm going to have to ask around.

I tried with latest firmware (using rpi-update) and the kernel 6.6.45 initially seemed to be more stable, but it also lead to the same issue... :(