robotastic / trunk-recorder

Records calls from a Trunked Radio System (P25 & SmartNet)
GNU General Public License v3.0
827 stars 191 forks source link

shmget (2): No space left on device workaround solution #880

Closed aegershman closed 7 months ago

aegershman commented 7 months ago

Hi, I recently ran into an error shmget (2): No space left on device and wanted to share my solution for searchability.

I run 3 RTLSDRs and 3 Airspy R2s on my trunk-recorder on Docker via docker-compose. When I start up the container I get this error due to being out of shared memory handles:

...
trunk-recorder  | gr::vmcircbuf :error: shmget (2): No space left on device
trunk-recorder  | buffer_double_mapped :error: gr::buffer::allocate_buffer: failed to allocate buffer of size 64 KB
trunk-recorder  | terminate reached from thread id: 7f3c1f7a8dc0Got std::bad_alloc
...

I found a solution in this Stackoverflow thread, the gist being, we need to bump the number of shared memory handles:

sudo sysctl kernel.shmmni=32000

But since I'm running this on Docker, I need a way to bump this in the container. Here's my workaround for docker-compose. We override the CMD on the Dockerfile by specifying it in command in docker-compose. It's the same start command just prefixed with sudo sysctl kernel.shmmni=32000:

services:
  trunkrecorder:
    image: robotastic/trunk-recorder:4.6.0
    privileged: true
    command: bash -c "sudo sysctl kernel.shmmni=32000 && trunk-recorder --config=/app/config.json"
...

(Note: I tried setting shm_size: 2gb as a param on my docker compose, but it didn't help. I don't believe this is a docker daemon issue so much as it is the docker base image of Ubuntu having it's kernel.shmmni default being 4096. You can check by ssh'ing into the container and executing cat /proc/sys/kernel/shmmni. This works because we're setting a kernel parameter within the container itself)

In any case, bumping the kernel.shmmni is working fine for me for the time being.

I hope this helps someone scrolling by. Feel free to close this issue, or I will close it in the next day or so. Just wanted to put this out there to be indexed in case it helps anyone else.

taclane commented 7 months ago

I believe another viable workaround is to set the tempDir config option to the same as captureDir.

A little while back, t-r started recording temporary files to the /dev/shm ram drive, but it seems like there are certain system configurations where it either isn't available, or isn't large enough to buffer the recordings before they're moved off to a physical disk or uploaded.

aegershman commented 7 months ago

TL;DR, this is due to the maximum number of shared memory handles allowed, shmmni. From what I've read this number is often configured on servers handling databases. This IBM guide for tuning IBM db2 instances shows the minimum enforced value of shmmni is 256 * <size of RAM in GB>. Point being, changing shmmni doesn't seem particularly uncommon. Manually setting this value for me in the Docker start command is working fine, but it wouldn't be crazy to include this in the Dockerfile if others see this happen. Totally fine doing it manually for now. Doesn't seem like tons of others are reporting it as a problem, so I think we're good ;)

@taclane so check this out. I tried setting tempDir to the same as captureDir, and it continued to crash. I don't think this is related to storage capacity, it's the maximum system-wide shared memory handles

Here's my testing. I ssh'd onto the container, ran df -h to check drive sizes:

root@43c8769c1a17:/app# df -h
Filesystem                         Size  Used Avail Use% Mounted on
overlay                            1.8T  1.2T  576G  68% /
tmpfs                               64M     0   64M   0% /dev
shm                                 64M  416K   64M   1% /dev/shm
/dev/mapper/ubuntu--vg-ubuntu--lv  1.8T  1.2T  576G  68% /app/media
tmpfs                              9.5G  4.2M  9.5G   1% /run/dbus
root@43c8769c1a17:/app#

Note shm has a capacity of 64M, only using 416K. But still crashing.

So let's check the number of shared memory handles currently in use. First, I got the container running by setting command: bash -c "sudo sysctl kernel.shmmni=32000 && trunk-recorder --config=/app/config.json". Then we ssh onto the container and run ipcs -m to see in-use shared memory segments

>ipcs -m
------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x00000000 98305      root       700        65536      2          dest
0x00000000 98306      root       400        4096       2          dest
0x00000000 98308      root       700        65536      2          dest
0x00000000 98309      root       400        4096       2          dest
0x00000000 98311      root       700        65536      2          dest
...
...

I truncated the rows, but it's a lot. We can count the rows of shared memory segments using this command:

# The `tail -n +4` trims off the column header lines
>ipcs -m | tail -n +4 | wc -l
9841

We see on my system, running full-bore with 3 RTLSDRs and 3 Airspy R2s, I have 9841 open shared memory segments. If I change my container start command to command: bash -c "sudo sysctl kernel.shmmni=9840 && trunk-recorder --config=/app/config.json", it crashes. But command: bash -c "sudo sysctl kernel.shmmni=9841 && trunk-recorder --config=/app/config.json"? It works just fine.

Another piece of confirmation. Using the default start command-- as in, not adding any sysctl kernel.shmmni=xyz-- the default value of kernel.shmmni is 4096:

>cat /proc/sys/kernel/shmmni
4096

Using the default start command, if I start up the container and ssh onto it quickly before it crashes, and execute the following command...

>watch -n 0.1 'ipcs -m | tail -n +4 | wc -l'
Every 0.1s: ipcs -m | tail -n +4 | wc -l                                                          92203588b2c3: Fri Dec  1 18:51:04 2023

4095
# exit due to container crashing

We're watching the list of active shared memory handles using ipcs -m every 0.1 seconds (as low as watch will go), and see right before the container crashes, the number of shared memory handles rockets up to ~4096 and exits. (I caught it exiting at 4095, likely due to the watch interval of 0.1 not catching the last handle before crashing). The default limit is 4096, so this checks out.

Hoping this makes sense, let me know if something seems amiss?