raspberrypi / rpicam-apps

BSD 2-Clause "Simplified" License
378 stars 212 forks source link

libcamera-vid variability in frames captured with no reported frame drops #157

Closed dustinkerstein closed 2 years ago

dustinkerstein commented 2 years ago

When running the following command: libcamera-vid --framerate 10 --width 4096 --height 3040 -t 24000 --awbgains 1,1 --shutter 100000 --denoise cdn_fast --codec yuv420 --save-pts timestamps.txt -o /dev/shm/camera.yuv The resulting output file sometimes contains 227 frames, and sometimes 231. Strangely I've never seen any other outcomes across ~10 attempts. I am calculating the number of frames by taking the byte size of the file and dividing by 18677760 (4096x3040x3/2):

4314562560 total bytes / 18677760 bytes per frame = 231 frames 4239851520 total bytes / 18677760 bytes per frame = 227 frames

Based on the 10 second timeout, I would expect to see 240 frames (10fps * 24 seconds), but that's potentially a different issue, or just not guaranteed with a time-based timeout. On a slight tangent, it would be nice to be able to provide a num-buffers parameter to capture an exact number of frames (important in my use-case). See here for a similar feature request for libcamerasrc.

Back to the variable frame output. You can find both timestamp files here: 227.txt, 231.txt

In the 227 timestamp file, there is a 4 frame gap between 1100 and 1600. However, in both cases libcamera-vid never logs a frame drop (ie. "Dropping unmatched input"). See the debug log for a 231 frame capture here - 231_Frame_Log.txt and for the 227 capture here - 227_Frame_Log.txt

Are you able to replicate this behavior? Let me know if there's anything else I can get.

I am testing with an Arducam Mini IMX477 module + Compute Module 4 + Official IO Board + Bullseye Lite 64bit + 5.10.63-v8+ Kernel running headless over ssh. Full system info here - https://pastebin.com/654vBepL

$ libcamera-hello --version
libcamera-apps build: 2a38ae93f143 21-10-2021 (14:53:29)
libcamera build: v0.0.0+3156-f4070274
$ vcgencmd version
Oct 29 2021 10:47:33
Copyright (c) 2012 Broadcom
version b8a114e5a9877e91ca8f26d1a5ce904b2ad3cf13 (clean) (release) (start)
naushir commented 2 years ago

Based on the 10 second timeout, I would expect to see 240 frames (10fps * 24 seconds), but that's potentially a different issue, or just not guaranteed with a time-based timeout.

I think this is just down to the inaccuracies of using an simple elapsed time based test to signal a timeout. I doubt this can be improved upon much.

On a slight tangent, it would be nice to be able to provide a num-buffers parameter to capture an exact number of frames (important in my use-case). See here for a similar feature request for libcamerasrc.

This is a good idea and probably something that should not be too difficult to implement.

dustinkerstein commented 2 years ago

I think this is just down to the inaccuracies of using an simple elapsed time based test to signal a timeout. I doubt this can be improved upon much.

But what do you think about that gap in the timestamp file? To me that would seem to indicate that it's not just a time counting difference.

This is a good idea and probably something that should not be too difficult to implement.

Ok, great. Having both libcamera-vid and libcamerasrc support num-buffers would be awesome. Thanks!

naushir commented 2 years ago

But what do you think about that gap in the timestamp file? To me that would seem to indicate that it's not just a time counting difference.

There has been a bug we fixed very recently to do with dropped frames. If you haven't updated your firmware in a while, you may be hitting this. To confirm, can you temporarily put isp_freq=500 in your /boot/config.txt file and see if this stops the issue?

This is a good idea and probably something that should not be too difficult to implement.

Ok, great. Having both libcamera-vid and libcamerasrc support num-buffers would be awesome. Thanks!

We will add support in libcamera-vid, but for libcamerasrc, you would have to send the request through either the libcamera mailing list or through bugzilla (which I see you have already raised). The latter is not something that is maintained by us.

dustinkerstein commented 2 years ago

There has been a bug we fixed very recently to do with dropped frames. If you haven't updated your firmware in a while, you may be hitting this. To confirm, can you temporarily put isp_freq=500 in your /boot/config.txt file and see if this stops the issue?

I will try updating my firmware later today, but wanted to quickly let you know before then, that my previous testing was with gpu_freq=800 (which also sets isp_freq=800 - verified by watching the clocks during camera capture). I'll let you know what it looks like after an update.

dustinkerstein commented 2 years ago

And a quick update from testing with no gpu_overclock (no overclocks at all), and I did a few tests of the above command. Of the 5 tests, I actually saw 2 report a dropped frame, so I discarded those results (potentially related to the bug you were describing). Of the other 3 tests, 2 captured 231 frames, and 1 actually captured only 221 (4127784960 bytes) frames. The latter had no reported frame drops in the console. From the timetamps.txt file, you can see a 10 frame gap between 1099.984 and 2199.971 - 221.txt

naushir commented 2 years ago

And a quick update from testing with no gpu_overclock (no overclocks at all), and I did a few tests of the above command. Of the 5 tests, I actually saw 2 report a dropped frame, so I discarded those results (potentially related to the bug you were describing). Of the other 3 tests, 2 captured 231 frames, and 1 actually captured only 221 (4127784960 bytes) frames. The latter had no reported frame drops in the console. From the timetamps.txt file, you can see a 10 frame gap between 1099.984 and 2199.971 - 221.txt

Is there anything else running in the system when you run this application? If you set your output file to /dev/null, does it eliminate these drops? I want to rule out the possibility of file IO causing this drop.

dustinkerstein commented 2 years ago

Outputting to /dev/null does seem to result in consistent 231 frames every time across 10 tests (with an overclock). So this raises a few thoughts for me:

  1. Is it really possible we're hitting memory bandwidth issues when writing to /dev/shm? If so, is there any way to tweak this tempfs filesystem setup to be better optimized for high bandwidth YUV capture? My worry is that if we're hitting issues at 10fps, there's no way it's going to handle full resolution at 30fps :(
  2. Beyond catching the gap in the timestamps file, I don't get any frame drop debug messages in the log. Do you think it's possible to find out exactly where those drops are occurring, and why they're not caught by the log (maybe it's something that could be tweaked and not a pure hardware/memory bandwidth issue...)
  3. If it really does seem like a /dev/shm memory bandwidth limitation (and no tweaks to the tempfs setup helps), is it at all possible to get libcamera/libcamera-apps to buffer using high memory rather than CMA? Then, for my rather special usecase (limited to 300 frame bursts), I could maybe just patch the sources to have a much larger number of buffers. Do you think that would that be feasible?

Thanks again.

Update - Forgot to answer one of your questions. I'm not running anything else. It's essentially a vanilla Bullseye Lite 64bit install running on an 8GB CM4 module, disabled swap memory, and currently an 800mhz GPU/ISP overclock.

naushir commented 2 years ago

Outputting to /dev/null does seem to result in consistent 231 frames every time across 10 tests (with an overclock). So this raises a few thoughts for me:

Ok, that at least puts us at a baseline.

  1. Is it really possible we're hitting memory bandwidth issues when writing to /dev/shm? If so, is there any way to tweak this tempfs filesystem setup to be better optimized for high bandwidth YUV capture? My worry is that if we're hitting issues at 10fps, there's no way it's going to handle full resolution at 30fps :(

It's possible, but I've not done any tests to say conclusively. However, I am curious about the /dev/shm device. On my pi, df -h /dev/shm reports the size of the disk as 4.0Gb. This is a bit suspicious, as that is the total memory available on my board. Could it be that /dev/shm actually swaps out to disk (even though it is meant to be a ramdisk) if it fills to a certain limit? What does the results show if you create your own ramdisk to a particular size using fstab?

  1. Beyond catching the gap in the timestamps file, I don't get any frame drop debug messages in the log. Do you think it's possible to find out exactly where those drops are occurring, and why they're not caught by the log (maybe it's something that could be tweaked and not a pure hardware/memory bandwidth issue...)

There is a possible framedrop in the kernel device driver that libcamera has no possibility of knowing. If you were to enable verbose kernel logging, this should show up.

  1. If it really does seem like a /dev/shm memory bandwidth limitation (and no tweaks to the tempfs setup helps), is it at all possible to get libcamera/libcamera-apps to buffer using high memory rather than CMA? Then, for my rather special usecase (limited to 300 frame bursts), I could maybe just patch the sources to have a much larger number of buffers. Do you think that would that be feasible?

Unfortunately not, the hardware requires buffers to be in the CMA region or they will not be addressable.

I may have asked this before, but do you need YUV420 output? Would MJPEG with a suitable high quality encode be acceptable for your usecase? I seem to be able to save 10fps MJPEG stream without framedrops on a standard sdcard:

libcamera-vid --frames 300 -o test.mjpeg -v --codec mjpeg --save-pts pts.txt --width 4096 --height 3040
dustinkerstein commented 2 years ago

It's possible, but I've not done any tests to say conclusively. However, I am curious about the /dev/shm device. On my pi, df -h reports the size of the disk as 4.0Gb. This is a bit suspicious, as that is the total memory available on my board. Could it be that /dev/shm actually swaps out to disk (even though it is meant to be a ramdisk) if it fills to a certain limit? What does the results show if you create your own ramdisk to a particular size using fstab?

Yeah, I'm a little unsure how it handles situations where /dev/shm is full but the OS needs some of that ram... I will try testing that out later. Throughout all of my tests I have actually been using a 6GB ramdisk specified via fstab: none /dev/shm tmpfs defaults,size=6G 0 0

There is a possible framedrop in the kernel device driver that libcamera has no possibility of knowing. If you were to enable verbose kernel logging, this should show up.

Interesting. I will try to enable verbose logging and will see if I can catch those drops.

Unfortunately not, the hardware requires buffers to be in the CMA region or they will not be addressable.

Bummer. And I've tried several ways to get more than ~576MB CMA and failed. I'm not sure it's possible on the Pi, but would love to be told otherwise. Do you think there's maybe a way to copy the CMA buffer -> high memory buffer inside either libcamera/libcamera-apps in a way that would exceed the performance of /dev/shm (which probably has some overhead)?

I may have asked this before, but do you need YUV420 output? Would MJPEG with a suitable high quality encode be acceptable for your usecase? I seem to be able to save 10fps MJPEG stream without framedrops on a standard sdcard:

Unfortunately I need to hit at least 20-24fps (ideally 30fps). Once I get the IMX477 4 lane CSI driver working, I will run some tests to see just how high the MJPEG encoder will go.

Thanks for all your help.

dustinkerstein commented 2 years ago

Quick update. Looks like /dev/shm isn't nearly as fast as I would expect:

pi@raspberrypi:~ $ sudo dd if=/dev/zero of=/dev/shm/zero bs=4k count=100000
100000+0 records in
100000+0 records out
409600000 bytes (410 MB, 391 MiB) copied, 0.825578 s, 496 MB/s
pi@raspberrypi:~ $ sudo dd if=/dev/zero of=/dev/shm/zero bs=4k count=100000
100000+0 records in
100000+0 records out
409600000 bytes (410 MB, 391 MiB) copied, 0.777537 s, 527 MB/s
pi@raspberrypi:~ $ sudo dd if=/dev/zero of=/dev/shm/zero bs=4k count=100000
100000+0 records in
100000+0 records out
409600000 bytes (410 MB, 391 MiB) copied, 0.768167 s, 533 MB/s
dustinkerstein commented 2 years ago

Interesting. Raising the block size helps:

pi@raspberrypi:~ $ sudo dd if=/dev/zero of=/dev/shm/zero bs=64k count=10000
10000+0 records in
10000+0 records out
655360000 bytes (655 MB, 625 MiB) copied, 0.807093 s, 812 MB/s

Do you happen to know how libcamera writes the data to disk?

naushir commented 2 years ago

Do you think there's maybe a way to copy the CMA buffer -> high memory buffer inside either libcamera/libcamera-apps in a way that would exceed the performance of /dev/shm (which probably has some overhead)?

This is not easy in libcamera-apps, it was just not structured to do this. However, it would be entirely possible to create a new libcamera-app that just memcpys the CMA buffer into another buffer or memory space and return the CMA buffer back for the hardware to reuse.

You might also want to try creating a ramdisk not called /dev/shm (maybe /ramdisk) with a fixed size to see if that improves things. /dev/shm seems to have a specific usage in Linux.

Unfortunately I need to hit at least 20-24fps (ideally 30fps). Once I get the IMX477 4 lane CSI driver working, I will run some tests to see just how high the MJPEG encoder will go.

Does the size of the YUV image matter much? If you get the ISP to scale that down, your IO and memory bandwidth will be reduced linearly, and that might help.

Do you happen to know how libcamera writes the data to disk?

This happens using standard file IO operations, see https://github.com/raspberrypi/libcamera-apps/blob/main/output/file_output.cpp. Perhaps there is scope to improve performance there as well.

dustinkerstein commented 2 years ago

This is not easy in libcamera-apps, it was just not structured to do this. However, it would be entirely possible to create a new libcamera-app that just memcpys the CMA buffer into another buffer or memory space and return the CMA buffer back for the hardware to reuse.

Interesting. That's probably a bit beyond my skills, but I'll keep it in mind.

You might also want to try creating a ramdisk not called /dev/shm (maybe /ramdisk) with a fixed size to see if that improves things. /dev/shm seems to have a specific usage in Linux.

I did a little research on the internet and it seems, performance-wise, tempfs is about the same compared to ramfs. I may give it a try just to see later this week.

Does the size of the YUV image matter much? If you get the ISP to scale that down, your IO and memory bandwidth will be reduced linearly, and that might help.

Yeah, for our use-case, we need as much resolution as possible (180x140ish degree fisheye capture and reprojecting to equirectangular).

This happens using standard file IO operations, see https://github.com/raspberrypi/libcamera-apps/blob/main/output/file_output.cpp. Perhaps there is scope to improve performance there as well.

Interesting. Every last bit (pun half intended) of bandwidth will help when trying to capture 560MB/s (YUV420 4056x3040 @ 30fps).

For now though, I first want to figure out where those missing frames are. I'll try to do some kernel logging and let you know what I find. Thanks for the help.

dustinkerstein commented 2 years ago

Hmm, I just tried to update the kernel log level via echo "1" | sudo tee /proc/sys/kernel/printk but that doesn't seem to show any logs when capturing and monitoring via journalctl -f - Do you think I need to build a kernel with the v4l2 dev log option enabled?

naushir commented 2 years ago

To enable the Unicam kernel debug logging, you can append bcm2835_unicam.debug=5 to /boot/cmdline.txt.

However, I've just had a quick peek, and unfortunately, there is no clear single line of logging that indicates a dropped frame. Rather, it has to be inferred from multiple lines of trace logging which is not very convenient. I'll look to add a single line of logging and tag you in the PR.

dustinkerstein commented 2 years ago

Sounds good. I'll be ready to test. Thanks!

kbingham commented 2 years ago

Do we have sequential sequence numbers out of the unicam which would indicate frame loss/drops that we can catch/report in libcamera?

naushir commented 2 years ago

Do we have sequential sequence numbers out of the unicam which would indicate frame loss/drops that we can catch/report in libcamera?

Yes we do, and this can be used to indicate frame drops as well!

kbingham commented 2 years ago

Maybe this is worthy of something like:

int dropped = currentSequence - previousSequence - 1; if (dropped) LOG(V4L2, Debug) << "Dropped " << dropped << " frame(s)"

in the V4L2VideoDevice then?

dustinkerstein commented 2 years ago

LOG(V4L2, Debug) is only output when v4l2 debug messages are set in the kernel flags, right? If so, I think it would be nice to show all frame drop related messages to the user/console regardless of v4l2 debug settings.

dustinkerstein commented 2 years ago

With the latest kernel (and updated libcamera and libcamera-apps) I am able to catch those frame drops. This timed capture (24 seconds @ 10fps) recorded 224 frames. libcamera-vid did catch one frame drop:

[0:08:02.959242278] [3065]  WARN RPI raspberrypi.cpp:1862 Dropping unmatched input frame in stream Unicam Embedded
[0:08:02.959277519] [3065] DEBUG RPI raspberrypi.cpp:1876 Could not find matching embedded buffer

And the new kernel logging caught a few more (see full log below for all drops):

Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: Scheduling dummy buffer for node 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: Scheduling dummy buffer for node 1
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x4, STA: 0x4001, sequence 16, lines done 760
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x4, STA: 0x4003, sequence 16, lines done 1520
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x4, STA: 0x4001, sequence 16, lines done 2280
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x6, STA: 0xC002, sequence 16, lines done 3039
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x1, STA: 0x4001, sequence 17, lines done 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: [0] Dropping frame, buffer not available at FS
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: Scheduling dummy buffer for node 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: [1] Dropping frame, buffer not available at FS
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: Scheduling dummy buffer for node 1
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x4, STA: 0x5003, sequence 17, lines done 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x4, STA: 0x4001, sequence 17, lines done 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x4, STA: 0x4003, sequence 17, lines done 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x6, STA: 0xC000, sequence 17, lines done 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x1, STA: 0x4003, sequence 18, lines done 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: [0] Dropping frame, buffer not available at FS
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: Scheduling dummy buffer for node 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: [1] Dropping frame, buffer not available at FS
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: Scheduling dummy buffer for node 1
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x4, STA: 0x5001, sequence 18, lines done 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x4, STA: 0x4003, sequence 18, lines done 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x4, STA: 0x4001, sequence 18, lines done 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x6, STA: 0xC002, sequence 18, lines done 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x1, STA: 0x4001, sequence 19, lines done 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: [0] Dropping frame, buffer not available at FS
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: Scheduling dummy buffer for node 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: [1] Dropping frame, buffer not available at FS
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: Scheduling dummy buffer for node 1
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x4, STA: 0x5003, sequence 19, lines done 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x4, STA: 0x4001, sequence 19, lines done 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x4, STA: 0x4003, sequence 19, lines done 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x6, STA: 0xC000, sequence 19, lines done 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x1, STA: 0x4003, sequence 20, lines done 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: [0] Dropping frame, buffer not available at FS
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: Scheduling dummy buffer for node 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: [1] Dropping frame, buffer not available at FS
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: Scheduling dummy buffer for node 1
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x4, STA: 0x5001, sequence 20, lines done 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x4, STA: 0x4003, sequence 20, lines done 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x4, STA: 0x4001, sequence 20, lines done 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x6, STA: 0xC002, sequence 20, lines done 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x1, STA: 0x4001, sequence 21, lines done 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: [0] Dropping frame, buffer not available at FS
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: Scheduling dummy buffer for node 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: [1] Dropping frame, buffer not available at FS
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: Scheduling dummy buffer for node 1
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x4, STA: 0x5003, sequence 21, lines done 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x4, STA: 0x4001, sequence 21, lines done 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x4, STA: 0x4003, sequence 21, lines done 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x6, STA: 0xC000, sequence 21, lines done 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x1, STA: 0x4003, sequence 22, lines done 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: [0] Dropping frame, buffer not available at FS
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: Scheduling dummy buffer for node 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: [1] Dropping frame, buffer not available at FS
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: Scheduling dummy buffer for node 1
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x4, STA: 0x5001, sequence 22, lines done 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x4, STA: 0x4003, sequence 22, lines done 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x4, STA: 0x4001, sequence 22, lines done 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x6, STA: 0xC002, sequence 22, lines done 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x1, STA: 0x4001, sequence 23, lines done 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: [0] Dropping frame, buffer not available at FS
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: Scheduling dummy buffer for node 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: Scheduling dummy buffer for node 1
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x4, STA: 0x4003, sequence 23, lines done 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x4, STA: 0x4001, sequence 23, lines done 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x4, STA: 0x4003, sequence 23, lines done 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x6, STA: 0xC000, sequence 23, lines done 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x1, STA: 0x4003, sequence 24, lines done 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: Scheduling dummy buffer for node 0
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: Scheduling dummy buffer for node 1
Nov 30 23:41:38 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x4, STA: 0x4001, sequence 24, lines done 760
Nov 30 23:41:39 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x4, STA: 0x4003, sequence 24, lines done 1520
Nov 30 23:41:39 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x4, STA: 0x4001, sequence 24, lines done 2280
Nov 30 23:41:39 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x6, STA: 0xC002, sequence 24, lines done 3039
Nov 30 23:41:39 raspberrypi kernel: unicam fe801000.csi: ISR: ISTA: 0x1, STA: 0x4001, sequence 25, lines done 0

One other interesting note is that the kernel log shows that 240 frames were requested. And even when it captures 231 (the max I've ever seen with 24 seconds @ 10fps), and you add the 7 intentionally dropped frames: [0:07:59.853805174] [3069] DEBUG IPARPI raspberrypi.cpp:273 Drop 7 frames on startup, the sum doesn't add up to 240. And during that 231 frame case, the kernel log doesn't report any frame drops.

All of the logs can be found below: 231timestamps.txt 231framesKernel.txt 224timestamps.txt 224framesKernel.txt 224framesApp.txt

Let me know if there's anything specific you'd like me to test. I'm going to also try the new --frames parameter in libcamera-vid to see if the discrepancies are any different.

dustinkerstein commented 2 years ago

And here are some logs when using the new --frames parameter: 240timestampsFramesParameter.txt 240framesKernelFramesParameter.txt 240framesAppFramesParameter.txt

During this capture, 240 frames were captured (4482662400/18677760), 0 drops were caught by libcamera-vid, and 7 drops were caught by the kernel logging.

kbingham commented 2 years ago

Maybe this is worthy of something like:

int dropped = currentSequence - previousSequence - 1; if (dropped) LOG(V4L2, Debug) << "Dropped " << dropped << " frame(s)"

in the V4L2VideoDevice then?

Interestingly, we've just hit exactly this on the IPU3 as well - so I've started adding frame drop detection logs in to V4L2VideoDevice.

Will probably want some rate-lmiited logging infrastructure in libcamera LOG_RATELIMITED(n/sec,...)

On the IPU3 our sequence numbers for the completed buffers come from the ISP which are always sequential even if the CSI receiver dropped a frame. I thought the ISP should preserve the sequence number of any buffer input into it, as the output - but it doesn't seem to.

What happens on RPi - does the ISP take any incoming sequence number and keep it as the outgoing sequence number? or does it generate a monotonic sequence number on the ISP outputs ?

naushir commented 2 years ago

What happens on RPi - does the ISP take any incoming sequence number and keep it as the outgoing sequence number? or does it generate a monotonic sequence number on the ISP outputs ?

Yes, it does independent monotonic sequence counting in the ISP driver and Unicam drivers...

naushir commented 2 years ago

Let me know if there's anything specific you'd like me to test. I'm going to also try the new --frames parameter in libcamera-vid to see if the discrepancies are any different.

Could you run libcamera-vid again, but with:

For me, both runs produce no frame drops with the following command:

libcamera-vid --framerate 10 --width 4056 --height 3040 -t 24000 --codec yuv420 --save-pts pts.txt

I have no overclocking enabled either.

naushir commented 2 years ago

I've had a bit more of a play with this and here are some further observations:

Another thing I noticed, your commands seem to use --width 4096 which is almost certainly not what you want. The sensor output is 4056 pixels wide, so you will be getting a small resample on the output to resize to 4096 pixels wide, and this will impact image quality. If you are doing this because of 64-byte wide alignment constraints, you probably want to use --width 4032 and add --roi=0,0,0.99409,1 to ensure the ISP crops an output window from the frame without resizing.

dustinkerstein commented 2 years ago

Good stuff. Seems like we've isolated things a bit. Confirming your results, when outputting to /dev/null I see no frame drops,

LIBCAMERA_LOG_LEVELS=0 ./libcamera-vid -t 24000 --framerate 10 --width 4056 --height 3040 --awbgains 1,1 --shutter 10000 --denoise cdn_fast --codec yuv420 --save-pts ~/timestamps.txt -o /dev/null

And also no drops when using no -o parameter:

LIBCAMERA_LOG_LEVELS=0 ./libcamera-vid -t 24000 --framerate 10 --width 4056 --height 3040 --awbgains 1,1 --shutter 10000 --denoise cdn_fast --codec yuv420 --save-pts ~/timestamps.txt

If I write a file to /dev/shm/, I do get multiple frame drops with the above command. It seems like a bulk of the drops always happen at the 1s mark. This is similar to your behaviour as well, which is suspicious.

Yeah, I consistently see the drops when writing to /dev/shm/ around that 1 second mark

If we delete the output file in /dev/shm/ manually before running libcamera-vid, I see no frame drops at all! This is with both 10 buffers, and the default 6 buffers. It seems the file truncation operation is surprisingly slow when the IO is flushed to the disk. @dustinkerstein could you try this and report back.

Yep! I see the same thing. If that file exists already, I will see drops. If /dev/shm is empty I do not see drops.

Another thing I noticed, your commands seem to use --width 4096 which is almost certainly not what you want. The sensor output is 4056 pixels wide, so you will be getting a small resample on the output to resize to 4096 pixels wide, and this will impact image quality. If you are doing this because of 64-byte wide alignment constraints, you probably want to use --width 4032 and add --roi=0,0,0.99409,1 to ensure the ISP crops an output window from the frame without resizing.

Yeah, I was trying to accommodate the stride requirement (incorrectly). All of my new testing has been with --width 4056.

So it seems like the net results from testing so far are:

  1. Writing to tempfs /dev/shm can cause frame drops if the output file already exists (or maybe just having other files in /dev/shm - need to test this as tempfs does some fancy memory management when it detects system ram is running low - though I do have swap disabled and I believe that is what it does when it needs to free space)
  2. The new kernel logging does catch these frame drops more accurately than libcamera-vid. Could libcamera-apps (and libcamera generally) be tweaked to be able to catch those drops directly?
  3. There is a suspicious time range around the 1 second mark that appears to be more susceptible to drops.

I'll post a few more testing results from my ISP bandwidth testing / hacking (capture 30fps 1080p but trick the ISP to process 4056x3040 frames.

naushir commented 2 years ago
  1. The new kernel logging does catch these frame drops more accurately than libcamera-vid. Could libcamera-apps (and libcamera generally) be tweaked to be able to catch those drops directly?

This would have to go into libcamera - either in the pipeline handler or the V4L2 core. @kbingham did you say you may be already doing this in the V4L2 class?

  1. There is a suspicious time range around the 1 second mark that appears to be more susceptible to drops.

I believe this is down to the the same file truncation overhead. The system is buffering up IO for approximately 1s then doing a large block write (involving truncating the file first causing the drops).

dustinkerstein commented 2 years ago

Note - This post is specific to ISP bandwidth testing / hacking (to see if the ISP can handle 4056x3040 12bit @ 30fps) and should not be looked at in context of this #157 dropped frames issue.

With the following changes to raspberrypi.cpp:

 // ISP HACK
format.size = Size(4056, 3040);

LOG(RPI, Debug) << "Setting " << stream->name() << " to "
        << format.toString();

ret = stream->dev()->setFormat(&format);
if (ret)
    return -EINVAL;

// ISP HACK
// if (format.size != cfg.size || format.fourcc != fourcc) {
//  LOG(RPI, Error)
//      << "Failed to set requested format on " << stream->name()
//      << ", returned " << format.toString();
//  return -EINVAL;
// }

I'm not entirely sure this is a valid test, as the ISP is receiving junk buffers so I'm not sure if that makes it easier (and thus invalid) to process frames compared to a true 4056x3040 frame. Log output seems to confirm CSI interface is receiving 2028x1080 frames, and the ISP is set to 4056x3040:

[11:53:58.898419666] [162540]  INFO RPI raspberrypi.cpp:677 Sensor: /base/soc/i2c0mux/i2c@1/imx477@1a - Selected sensor format: 2028x1080-SBGGR12_1X12 - Selected unicam format: 2028x1080-pBCC
[11:53:58.898618367] [162540] DEBUG RPI raspberrypi.cpp:710 Setting ISP Output0 to 4056x3040-YU12
[11:53:58.898746013] [162540] DEBUG RPI raspberrypi.cpp:778 Setting ISP Output1 (internal) to 1200x898-YU12

When running both:

LIBCAMERA_LOG_LEVELS=0 ./libcamera-vid --framerate 30 --width 1920 --height 1080 -t 10000 --awbgains 1,1 --shutter 10000 --denoise cdn_fast --codec yuv420 --save-pts ~/timestamps.txt -o /dev/null

LIBCAMERA_LOG_LEVELS=0 ./libcamera-vid --framerate 30 --width 1920 --height 1080 -t 10000 --awbgains 1,1 --shutter 10000 --denoise cdn_fast --codec yuv420 --save-pts ~/timestamps.txt

I get about 30 drops.

But when I drop the framerate to 24 (still outputting to /dev/null) I get no frame drops. That's good to see, but feels somewhat contradictory to what https://bugs.libcamera.org/show_bug.cgi?id=77 describes. But I'll ignore that for now.

When I try to write either 24 or 30 fps to /dev/shm/ I get many more drops, and the effective non-drop framerate seems to be around 17fps for me.

To me, it seems like this ISP hacked bandwidth test appears to be at least somewhat valid given the change in behavior. If we consider it valid, then it seems the ISP can handle 4056x3040 12bit @ 24fps. The issue then becomes the fact that writing to /dev/shm (even when it's empty) is still causing drops. Since this particular behavior is somewhat related to #157, I'm curious if anyone here thinks there's a way to improve writing to /dev/shm.

dustinkerstein commented 2 years ago

I believe this is down to the the same file truncation overhead. The system is buffering up IO for approximately 1s then doing a large block write (involving truncating the file first causing the drops).

Kk, so one workaround is just to delete that file before writing to it. But do you think there's any way to handle it better in the case that it does? And related to my ISP testing / hacking above, it seems there are potentially general bandwidth IO issues when trying to write 4056x3040 beyond ~17fps to /dev/shm. Do you think there's anything that could be improved in that area? Thanks again!

naushir commented 2 years ago

But when I drop the framerate to 24 (still outputting to /dev/null) I get no frame drops. That's good to see, but feels somewhat contradictory to what https://bugs.libcamera.org/show_bug.cgi?id=77 describes. But I'll ignore that for now.

If you increase core_freq= and isp_freq= to something like 550, this might be enough of an overclock to get your required 30fps. You may also have to use --denoise cdn_off, not too sure.

Kk, so one workaround is just to delete that file before writing to it. But do you think there's any way to handle it better in the case that it does? And related to my ISP testing / hacking above, it seems there are potentially general bandwidth IO issues when trying to write 4056x3040 beyond ~17fps to /dev/shm. Do you think there's anything that could be improved in that area? Thanks again!

Writing to /dev/shm seems to have more of an overhead than a standard memcpy (not surprising). We have a mode of operation where a large circular buffer is used to store the images until exit, where it is then flushed to disk. You invoke this by the --circular command line option. Currently, this circular buffer is 4Mb in size (see here). You could perhaps investigate increasing that size to something like 6Gb to store all 300 12Mpix YUV frames, and avoid any IO operations until the end of the program. Of course, to allocate and address 6Gb of memory, you have to be running in a 64-bit environment.

dustinkerstein commented 2 years ago

If you increase core_freq= and isp_freq= to something like 550, this might be enough of an overclock to get your required 30fps. You may also have to use --denoise cdn_off, not too sure.

I'm actually already testing with gpu_freq=800 (which sets ISP = 800) so I'm probably close to maxing things out.

Writing to /dev/shm seems to have more of an overhead than a standard memcpy (not surprising). We have a mode of operation where a large circular buffer is used to store the images until exit, where it is then flushed to disk. You invoke this by the --circular command line option. Currently, this circular buffer is 4Mb in size (see here). You could perhaps investigate increasing that size to something like 6Gb to store all 300 12Mpix YUV frames, and avoid any IO operations until the end of the program. Of course, to allocate and address 6Gb of memory, you have to be running in a 64-bit environment.

Now that sounds intriguing. I will test that tonight, and maybe also trying to integrate FFmpeg so I can start encoding in parallel (using FFmpeg to do the necessary tiling / cropping in software to not further tax the ISP). Will let you know how that goes. Thanks!

naushir commented 2 years ago

I'm actually already testing with gpu_freq=800 (which sets ISP = 800) so I'm probably close to maxing things out.

In that case try --denoise cdn_off and see if that helps. 800Mhz really ought to be enough for 12MP 30fps in the hardware.

davidplowman commented 2 years ago

I notice that the size_, rptr_ and wptr_ members of the circular buffer output class are unsigned ints (here). I think they can just be replaced by uint64_t and it will all work just the same. Of course you could start with < 4GB initially, just to see if it seems to work.

kbingham commented 2 years ago
  1. The new kernel logging does catch these frame drops more accurately than libcamera-vid. Could libcamera-apps (and libcamera generally) be tweaked to be able to catch those drops directly?

This would have to go into libcamera - either in the pipeline handler or the V4L2 core. @kbingham did you say you may be already doing this in the V4L2 class?

Yes, I've started adding this already.

dustinkerstein commented 2 years ago

I notice that the size_, rptr_ and wptr_ members of the circular buffer output class are unsigned ints (here). I think they can just be replaced by uint64_t and it will all work just the same. Of course you could start with < 4GB initially, just to see if it seems to work.

Just gave 6GB a shot with the following modifications:

image

And it does avoid frame drops, but seems to have trouble after caching ~2GB. When capturing 4056x3040 @ 10fps, I can only get it to write 114 frames (even though it captured > 114 frames). The following command:

LIBCAMERA_LOG_LEVELS=0 ./libcamera-vid --circular --framerate 10 --width 4056 --height 3040 --frames 114 --awbgains 1,1 --shutter 10000 --denoise cdn_fast --codec yuv420 --save-pts ~/timestamps.txt -o ~/camera.yuv

Captures 114 frames. But when setting frames>114, it still only writes 114 frames. Here's the tail end of a log when I tried capturing for 24 seconds @ 10fps:

[21:49:37.932779112] [323746] DEBUG DelayedControls delayed_controls.cpp:179 Queuing Analogue Gain to 853 at index 241
[21:49:37.932869555] [323746] DEBUG DelayedControls delayed_controls.cpp:179 Queuing Exposure to 350 at index 241
[21:49:37.932950128] [323746] DEBUG DelayedControls delayed_controls.cpp:179 Queuing Vertical Blanking to 460 at index 241
[21:49:37.933052997] [323746] DEBUG RPISTREAM rpi_stream.cpp:237 Queuing buffer 0 for ISP Stats
[21:49:37.933108959] [323746] DEBUG V4L2 v4l2_videodevice.cpp:1593 /dev/video16[18:cap]: Queueing buffer 0
[21:49:37.933322752] [323746] DEBUG Request request.cpp:287 Request(232:C:0/1:0)
[21:49:37.933412491] [323748] DEBUG Camera camera.cpp:1126 Stopping capture
[21:49:37.944448107] [323746] DEBUG Request request.cpp:287 Request(233:X:0/1:0)
[21:49:37.944503477] [323746] DEBUG Request request.cpp:287 Request(234:X:0/1:0)
[21:49:37.944536884] [323746] DEBUG Request request.cpp:287 Request(235:X:0/1:0)
[21:49:37.944600679] [323746] DEBUG Request request.cpp:287 Request(236:X:0/1:0)
[21:49:37.944690270] [323746] DEBUG Request request.cpp:287 Request(237:X:0/1:0)
[21:49:37.944959803] [323746] DEBUG V4L2 v4l2_videodevice.cpp:1459 /dev/video0[13:cap]: Releasing buffers
[21:49:37.945026802] [323746] DEBUG V4L2 v4l2_videodevice.cpp:1177 /dev/video0[13:cap]: 0 buffers requested.
[21:49:37.955564704] [323746] DEBUG V4L2 v4l2_videodevice.cpp:1459 /dev/video1[14:cap]: Releasing buffers
[21:49:37.955687739] [323746] DEBUG V4L2 v4l2_videodevice.cpp:1177 /dev/video1[14:cap]: 0 buffers requested.
[21:49:37.955936365] [323746] DEBUG V4L2 v4l2_videodevice.cpp:1459 /dev/video13[15:out]: Releasing buffers
[21:49:37.955974697] [323746] DEBUG V4L2 v4l2_videodevice.cpp:201 Cache misses: 240
[21:49:37.959748173] [323746] DEBUG V4L2 v4l2_videodevice.cpp:1177 /dev/video13[15:out]: 0 buffers requested.
[21:49:37.959811116] [323746] DEBUG V4L2 v4l2_videodevice.cpp:1459 /dev/video14[16:cap]: Releasing buffers
[21:49:37.959986428] [323746] DEBUG V4L2 v4l2_videodevice.cpp:1177 /dev/video14[16:cap]: 0 buffers requested.
[21:49:37.963854310] [323746] DEBUG V4L2 v4l2_videodevice.cpp:1459 /dev/video15[17:cap]: Releasing buffers
[21:49:37.963924401] [323746] DEBUG V4L2 v4l2_videodevice.cpp:1177 /dev/video15[17:cap]: 0 buffers requested.
[21:49:37.964562465] [323746] DEBUG V4L2 v4l2_videodevice.cpp:1459 /dev/video16[18:cap]: Releasing buffers
[21:49:37.964619760] [323746] DEBUG V4L2 v4l2_videodevice.cpp:1177 /dev/video16[18:cap]: 0 buffers requested.
NullEncoder closed
Wrote 2129264640 bytes (114 frames)

Any idea on why that might be happening?

davidplowman commented 2 years ago

You've spotted a couple more uint64_ts which is great, but I think we might need to change the return type of the Available() function to uint64_t too. I have a feeling the while loop in the outputBuffer() function will keep dragging the front of the buffer forward once that function returns a small number (modulo 2^32).

Oh, and in the destructor, the total variable will want to be uint64_t too. I don't think anything will go wrong there, but it would misreport the number of bytes written.

Obviously these are all good changes and we'll commit them to the code once it's working. We'll parametrise the buffer size too, via a command line option or something like that.

dustinkerstein commented 2 years ago

That all sounds good. I'm starting to appreciate how expensive time-wise memory allocation is (6GB seems to take ~10 seconds). Do you think there's any feasibility of allocating/copying single frame buffers on-the-fly, or does it just take too much time?

naushir commented 2 years ago

That all sounds good. I'm starting to appreciate how expensive time-wise memory allocation is (6GB seems to take ~10 seconds). Do you think there's any feasibility of allocating/copying single frame buffers on-the-fly, or does it just take too much time?

I doubt that 300x single frame allocations are going to be much quicker. But it should be easy enough for you to test this in a simple program. If you do go down this route, you also cannot use the --circular functionality already provided and will have to implement something new that provides similar functionality.

naushir commented 2 years ago

I'm not entirely sure this is a valid test, as the ISP is receiving junk buffers so I'm not sure if that makes it easier (and thus invalid) to process frames compared to a true 4056x3040 frame. Log output seems to confirm CSI interface is receiving 2028x1080 frames, and the ISP is set to 4056x3040:

One issue I see with this is that you are not modelling the Unicam -> memory write correctly. Instead of writing 4056x3040@30fps to SDRAM, we are doing 2028x1080@30fps, so there will be addition memory bandwidth that needs to be accounted for.

Did you manage to run with --denoise cdn_off to see if the ISP would hit 30fps with your overclock?

dustinkerstein commented 2 years ago

I doubt that 300x single frame allocations are going to be much quicker. But it should be easy enough for you to test this in a simple program. If you do go down this route, you also cannot use the --circular functionality already provided and will have to implement something new that provides similar functionality.

Yeah, I might give it a shot later this week. First test would be to just see how much time it takes to copy the buffer, and if it's reasonable I'll try to do a simple poc.

One issue I see with this is that you are not modelling the Unicam -> memory write correctly. Instead of writing 4056x3040@30fps to SDRAM, we are doing 2028x1080@30fps, so there will be addition memory bandwidth that needs to be accounted for.

True, but do you think that affects the ISP's resources directly, or more overall system memory bandwidth?

Did you manage to run with --denoise cdn_off to see if the ISP would hit 30fps with your overclock?

I just did and it seems to perform the same as with cdn_fast with an effective rate of ~26fps when writing to /dev/null

naushir commented 2 years ago

True, but do you think that affects the ISP's resources directly, or more overall system memory bandwidth?

It would impact system bandwidth, no the ISP pixel processing rate.

Did you manage to run with --denoise cdn_off to see if the ISP would hit 30fps with your overclock?

I just did and it seems to perform the same as with cdn_fast with an effective rate of ~26fps when writing to /dev/null

This seems to imply that memory bandwidth has become a limiting factor. Theoretically, the ISP should be able to keep up with an 800Mhz clock.

dustinkerstein commented 2 years ago

This seems to imply that memory bandwidth has become a limiting factor. Theoretically, the ISP should be able to keep up with an 800Mhz clock.

Ok, interesting. It's very close... I'll keep hacking around to see if I can figure out a suitable buffering solution.

Feel free to close this issue unless anyone feels there's anything left actionable. Thanks everyone for the help.

PS - Arducam wasn't too keen on providing a quote for a 4 lane CSI IMX477 patch given their limited resources, and the limited functionality it would have (ie. only really being suitable for small burst captures of YUV/Bayer to ramdisk). I'll try to convince them to at least give me a quote, but if anyone knows of another way to get a 4 lane CSI patch (or is interested in helping via a bounty), feel free to reach out to me directly at dustin @ panomoments.com

dustinkerstein commented 2 years ago

This seems to imply that memory bandwidth has become a limiting factor. Theoretically, the ISP should be able to keep up with an 800Mhz clock.

Ahha! It looks like the ISP clock slows down from an overclocked 850mhz -> 300mhz after around 5 seconds, and that's where the frame drops start happening. Note that I'm testing using the hack to force ISP Output0 to 4056x3040 with the sensor running at 2028x1080 (though I see this same clock slowdown without the hack - just no frame drops). If I use force_turbo=1, the ISP clock stays at 850mhz, and I don't see any drops. Ie. 4056x3040 30fps ISP output seems to work when writing to /dev/null. Has anyone seen the ISP clock slowing down like that?

ISPClockSlowdown.txt ISPClockForceTurbo.txt

Update - Just wanted to note that cpu temp was ~50'C throughout testing.

naushir commented 2 years ago

Ahha! It looks like the ISP clock slows down from an overclocked 850mhz -> 300mhz after around 5 seconds, and that's where the frame drops start happening. Note that I'm testing using the hack to force ISP Output0 to 4056x3040 with the sensor running at 2028x1080 (though I see this same clock slowdown without the hack - just no frame drops). If I use force_turbo=1, the ISP clock stays at 850mhz, and I don't see any drops. Ie. 4056x3040 30fps ISP output seems to work when writing to /dev/null. Has anyone seen the ISP clock slowing down like that?

Try using gpu_freq_min instead of gpu_freq and remove force_turbo. The gpu_freq still allows clock scaling, but gpu_freq_min sets a lower bound. Good to hear this keeps up without frame drops. Are you using --denoise cdn_off as well? I suspect you can wind down the 850Mhz to something like 550-600Mhz and still keep up.

dustinkerstein commented 2 years ago

Try using gpu_freq_min instead of gpu_freq and remove force_turbo. The gpu_freq still allows clock scaling, but gpu_freq_min sets a lower bound. Good to hear this keeps up without frame drops. Are you using --denoise cdn_off as well? I suspect you can wind down the 850Mhz to something like 550-600Mhz and still keep up.

Ok cool. I'll give that a try with cdn_off and cdn_fast (usually I'm testing with the latter as I will want to use this NR setting).

The bigger news is that I think I found the bottleneck which has been causing lots of head scratching (at least on my part). Writing to tempfs /dev/shm isn't the issue. It appears as though libcamera-vid's output itself is the bottleneck. Check out these tests and let me know what you think:

Pipe Bandwidth:

dd if=/dev/zero bs=6M count=1000 | dd of=/dev/null bs=6M
6291456000 bytes (6.3 GB, 5.9 GiB) copied, 8.75606 s, 719 MB/s

Tempfs /dev/shm Bandwidth:

dd bs=6M count=1000 if=/dev/zero of=/dev/shm/test
6291456000 bytes (6.3 GB, 5.9 GiB) copied, 8.9798 s, 701 MB/s

libcamera-vid Bandwidth (using hacked code from above to output 4096x3040 YUV frames from the ISP):

libcamera-vid --width 1920 --height 1080 --framerate 30 -t 10000 --gain 1 --awbgains 1,1 \
--shutter 1000 --denoise off --codec yuv420 -o - | dd of=/dev/null bs=6M
3698196480 bytes (3.7 GB, 3.4 GiB) copied, 11.2344 s, 329 MB/s
3698196480 bytes / 18677760 bytes per frame = 198 frames 
101 drops - 1 frame is always missing with -t parameter

libcamera-vid --width 1920 --height 1080 --framerate 18 -t 10000 --gain 1 --awbgains 1,1 \
--shutter 1000 --denoise off --codec yuv420 -o - | dd of=/dev/null bs=6M
3343319040 bytes (3.3 GB, 3.1 GiB) copied, 10.884 s, 307 MB/s
3343319040 bytes / 18677760 bytes per frame = 179 frames 
0 drops - 1 frame is always missing with -t parameter

At around 19fps, libcamera-vid starts to drop frames. It appears the max throughput of libcamera-vid seems to be around 330MB/s-350MB/s, and that's why I am dropping frames when writing to /dev/shm (or piping to FFmpeg/Gstreamer). Do you think that's something that can be investigated? Let me know if you'd like me to test anything else. Thanks!

Quick Update: --flush doesn't seem to have an effect on the above behavior based on a few quick tests.

dustinkerstein commented 2 years ago

Reading / copying those buffers seems to be the culprit. For 4056x3040 frames it appears to take ~55ms for fwrite(mem, size, 1, fp_), and I get the same time if I manually copy the buffer with memcopy. 1 / 0.055 = 18.2, which is the FPS threshold where I start seeing drops. 55ms for copying 18.7MB is only 340MB/s, which is around half of the throughput of the dd / pipe tests above. I'll keep poking around. Have a great weekend.

Update: While these are unrealistic / best-case scenario numbers, I just wanted to post a mbw memory benchmark of the Compute Module 4 (overclocked) for reference:

pi@raspberrypi:~ $ mbw 32 | grep AVG
AVG Method: MEMCPY  Elapsed: 0.01163    MiB: 32.00000   Copy: 2752.144 MiB/s
AVG Method: DUMB    Elapsed: 0.01161    MiB: 32.00000   Copy: 2756.625 MiB/s
AVG Method: MCBLOCK Elapsed: 0.00700    MiB: 32.00000   Copy: 4568.296 MiB/s
dustinkerstein commented 2 years ago

Another quick update. I tested --circular again to see how it compared, even though for my use-case it isn't suitable due to the initial ~10 second memory allocation time. The Write method takes ~38ms for 4056x3040 frames, and is able to successfully capture at 26fps. Beyond that, it starts dropping frames. Here is the command I was testing with (along with the hacked code from above to output 4096x3040 YUV frames from the ISP):

libcamera-vid --circular 6000 --framerate 26 --width 1920 --height 1080 -t 8000 --codec yuv420 -o /dev/null

So, pre-allocating the buffers does save a little time (55ms -> 38ms), but it still is bottlenecked by the memcopy / fwrite. From some quick research, it does seem like memory copies from CMA -> User Memory is known to be slow. Do you think that's what we're hitting?

dustinkerstein commented 2 years ago

Just did a few more tests. Hopefully they provide some insight. I tried the following command: libcamera-vid --circular 6000 --framerate 30 --width 1920 --height 1080 -t 8000 --codec yuv420 -o /dev/null with varying buffers defined in libcamera_app.cpp, up to the max my Pi allowed.

15 buffers - Wrote 3922329600 bytes (210 frames), Wrote 3959685120 bytes (212 frames)
14 buffers - Wrote 3941007360 bytes (211 frames), Wrote 3922329600 bytes (210 frames)
13 buffers - Wrote 3922329600 bytes (210 frames), Wrote 3903651840 bytes (209 frames)
12 buffers - Wrote 3903651840 bytes (209 frames), Wrote 3903651840 bytes (209 frames)
10 buffers - Wrote 3847618560 bytes (206 frames), Wrote 3847618560 bytes (206 frames)
8 buffers - Wrote 3847618560 bytes (206 frames), Wrote 3847618560 bytes (206 frames)
6 buffers - Wrote 3810263040 bytes (204 frames), Wrote 3772907520 bytes (202 frames)
2 buffers - Wrote 3698196480 bytes (198 frames), Wrote 3754229760 bytes (201 frames)
1 buffer - Wrote 2932408320 bytes (157 frames), Wrote 2932408320 bytes (157 frames)

Note that as with the previous tests, this is with the ISP forced to output 4056x3040 frames. It seems as though increasing the amount of buffers matters, but has diminishing returns.

naushir commented 2 years ago

The CMA buffers have to be uncached, so that is probably there is a performance penalty compared to user memory buffers.

Regarding the 10s allocation time with --circular, it turns out that the std::vector constructor will allocate as well as zero the memory allocated if a size parameter is provided. This is why it takes ~10s. You can avoid the zeroing of the memory by not passing the size into the constructor, and calling vector.reserve(SIZE) instead. A quick test shows that this takes the same amount of time as a standard malloc(SIZE) call (60us for a 2GB buffer). However, when you come to access the memory, pages do get allocated on-the-fly. This will mean that as you fill the circular buffer, there may be slowdowns along the way. If you are lucky, the additional frame buffers may be able to take the slack and keep up without dropping the frame.

So this may come down to a choice of taking 10s up-front and ensure all the pages are allocated, or allocating the pages on-the-fly, and having enough buffering in the camera pipeline to avoid any frame drops.