motioneye-project / motioneye

A web frontend for the motion daemon.
GNU General Public License v3.0
3.99k stars 655 forks source link

Motion eye eating a lot of CPU. Running strace shows lots of close() calls with bogus file descriptors. #3008

Closed marcopaganini closed 5 months ago

marcopaganini commented 5 months ago

Hello,

I've been running motioneye for a very long time inside a docker container (v0.42). Today I decided to upgrade the packages on by Debian system (pinned to the "testing" repo). I purposely hold the important packages like the kernel, my ZFS DKMS module and all the critical stuff.

For some reason, motioneye (again, running inside docker, without changes) is now extremely slow. A restart operation takes around 5-10 minutes and the machine load (an old FX8350 octa-core) goes to 30+. Before this problem, it rarely went over 3.0 or so. After that, I see intermittent 'mlX:CAMERA_NAME' processes also taking a lot of CPU.

The interesting thing is that this happens with the meyectl process and all the mlX:CAMERA_NAME processes. They all started to consume a lot of CPU. Looking at strace in one of them, I see an almost endless stream of:

close(345102222)                        = -1 EBADF (Bad file descriptor)
close(345102221)                        = -1 EBADF (Bad file descriptor)
close(345102220)                        = -1 EBADF (Bad file descriptor)
close(345102219)                        = -1 EBADF (Bad file descriptor)
close(345102218)                        = -1 EBADF (Bad file descriptor)
close(345102217)                        = -1 EBADF (Bad file descriptor)
close(345102216)                        = -1 EBADF (Bad file descriptor)
close(345102215)                        = -1 EBADF (Bad file descriptor)

This goes on for minutes. No other messages on strace.

Also, Notice how the file descriptor is being decremented.

It's as if, somehow, some component is trying to close all descriptors by brute force and selecting a default that is way too high and this is eating all the time and CPU.

Any ideas welcome.

marcopaganini commented 5 months ago

Looking around I found that motioneye inside my docker container uses python 2.7. I checked the code and the only "suspicious" item is a call to subprocess.Popen with close_fds set to True. I also found this which could be related to the problem: https://github.com/python/cpython/issues/79938

Basically it appears that python will try to close ALL file descriptors (as I initially suspected). In my system, I get the following when attempting to fetch os.sysconf("SC_OPEN_MAX"):

$ docker run --rm python python3 -c \
            $'import os\nprint(os.sysconf("SC_OPEN_MAX"))'
1073741816

So, it's a huge number. I wonder if this is what's biting me (a mix of python 2.7 close_fds bug and a large SC_OPEN_MAX).

MichaIng commented 5 months ago

Before wasting time with an ancient motionEye version, try the current one, which uses Python 3: https://github.com/motioneye-project/motioneye?tab=readme-ov-file#installation

For Docker: https://github.com/motioneye-project/motioneye/pkgs/container/motioneye

I purposely hold the important packages like the kernel, my ZFS DKMS module and all the critical stuff.

This might actually be the issue: There is no point to not upgrade the kernel, when you do upgrade everything else. Especially on Debian testing, packages go through breaking changes, which can include new kernel features/configs. Especially Docker makes vast use of various kernel features. And also, new kernel versions fix security vulnerabilities. Always upgrade your kernel along with other packages, do only set packages on hold, when you know that they cause issues, and regularly check back whether the issues have been fixed, to unhold them again. If you do not want to regularly upgrade you system, use Debian stable instead.

Of course, have a backup at hand, in case a kernel upgrade really causes serious breakage. However, I am running two x86_64 Debian production server systems on edge testing, and kernel upgrades literally never caused any issues. Docker upgrades on the other hand did a few times.

marcopaganini commented 5 months ago

Thanks @MichaIng .

Yesterday, I cloned the repo and built a docker image and now everything works perfectly!

What confused me is that my version seems to match the latest released version on the github page, so I thought it would be pointless to rebuild from head since head was the latest already. I see now that I'm wrong.

Which version is head tracking? Or more specifically, in which version was python2 deprecated in favor of python3?

This might actually be the issue: There is no point to not upgrade the kernel, when you do upgrade everything else. Especially on Debian testing, packages go through breaking changes, which can include new kernel features/configs. Especially Docker makes vast use of various kernel features. And also, new kernel versions fix security vulnerabilities. Always upgrade your kernel along with other packages, do only set packages on hold, when you know that they cause issues, and regularly check back whether the issues have been fixed, to unhold them again. If you do not want to regularly upgrade you system, use Debian stable instead.

My guess is that this was caused by a libc upgrade, and reporting way more files. Python 2 subprocess.open apparently deals with it in a dumb way. Fortunately they fixed it in Python 3.

Thanks for the docker link! This is exactly what I was looking for!

MichaIng commented 5 months ago

0.42.1 is the (last) Python 2 version, 0.43.0 is the first Python 3 version. Not sure how you are tracking it, but we did not do a "stable" release yet, just a pre-release, maybe this is why you missed it. And regarding the Docker container, similarly, we did not push containers to Docker Hub yet.

There are still many known bugs, but since the world (foreign APIs etc) changes, the old "stable" motionEye is far more buggy than the new "beta" one. Maybe we should not try to make it perfect, but release it with a list of known issues attached to release, being an enhancement in any case. So everyone recognises it on all channels 🤔.

My guess is that this was caused by a libc upgrade, and reporting way more files. Python 2 subprocess.open apparently deals with it in a dumb way. Fortunately they fixed it in Python 3.

A container however has its own libc, so not sure how a host libc upgrade can have an effect. However, something seems to have directly or indirectly an effect. Great that the new motionEye version solved it 👍.