Closed DanielRaapDev closed 4 months ago
I set the environment variable DEBUG
to get more output. We will see tomorrow...
Tonight the update of the mirror was successful.
I wonder if a parallel read requests locks the file currently written by CveCommand
and breaks it 🤔 We need to catch the exception while it occurs...
The problem occured now again. I see in the docker logs that the mirror script run twice! It started 00:00:01 and 00:01:06. That's why I also previously saw this warning:
2024-07-31 00:01:04,598 WARN No file matches via include "/etc/supervisor/conf.d/*.conf"
Unlinking stale socket /dev/shm/supervisor.sock
The second run got the problem "Unable to read cached data: /usr/local/apache2/htdocs/nvdcve-2024.json.gz" at 00:01:22. The file was modified last 00:01. So the parallel run might try to read the file which is currently beeing written.
Now after reviewing the log files I see that this happend too on 28th when the cache broke last and some days before. On that days at the end there is also this line:
WARN exited: init_nvd_cache (exit status 1; not expected)
I wonder why the process runs twice just at some days 🤔 Is it a race condition between supervisord and cron? It looks that the additional run is an init_nvd_cache
. I guess it is intended to be used on the first run when the mirror is empty. I don't know how supervisord works and why it starts exactly at that time. Maybe because it detects the same command run by cron that itself manages (mirror.sh
)?
The real problem is that vulnz
is accessing the files without a lock or to replace a fully written file instead of writing it in place.
I dig in the Kubernets logs. The pod/container was restarted at that time. The description is:
State: Running
Started: Wed, 31 Jul 2024 02:01:04 +0200
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Tue, 30 Jul 2024 19:48:18 +0200
Finished: Wed, 31 Jul 2024 02:01:03 +0200
The memory of the container is not enough. We limit it to 2 4 GB. So because the process is killed by OutOfMemory the files get corrupt.
So this is a memory issue. We've already done a lot to improve this - not sure how much more we can do.
But maybe you can improve the Dockerfile by removing the fixed JAVA_OPT=-Xmx2g
. Java 17 has container support so dynamic memory allocation for the container can be used.
The VM now provides automatic container detection support, which allows the VM to determine the amount of memory and number of processors that are available to a Java process running in docker containers.
Edit:
Without special JAVA_OPT
I got a OutOfMemoryError. So I set the same options we use for all our Java containers:
JAVA_OPT=-XX:InitialRAMPercentage=50.0 -XX:MaxRAMPercentage=80.0
As far as I know nist-data-mirror downloaded data to a temp directory and only then copied it to apache dir. Probably to avoid such issues. Shouldn't vulnz docker image have to same logic? That's not about OOM only, vulnz process may crash because of another issues which may nvd cache files broken.
This is a follow-up of https://github.com/jeremylong/DependencyCheck/issues/5798
We see this exception recently. Approximately since beginning of july. We use a local mirror in a Kubernetes cluster running open-vulnerability-data-mirror image (currently v6.1.7).
Our findings so far:
I tried running the update manually via /mirror.sh while in an inconsistent state. It failed directly in the container reading its local files. So I guess somehow the update running in the container destroys some files so they are no longer usable. These corrupt files are then propagated to the users of the mirror.
Observations of this weekend: One night the
mirror.sh
was successful in updating the files. The second night it got somehow interrupted and only three files were updated:When running the script directly it shows that the last file
nvdcve-2023.json.gz
is corrupt. That is also the content of the file/var/log/cron_mirror.log
of the last night.Full stacktrace: mirror-exception-zlibstream.txt
So I don't know how the file gets corrupted. But it would be great if
vulnz
would just delete the file and download the current version in thecve
command.