prometheus-community / node-exporter-textfile-collector-scripts

Scripts for node-exporter's textfile collector
Apache License 2.0
505 stars 188 forks source link

do not run apt update or simulate apt dist-upgrade #181

Closed anarcat closed 1 year ago

anarcat commented 1 year ago

This is causing all sorts of problems. The first of which is that we're hitting our poor mirrors every time the script is ran, which, in the Debian package configuration, is every 15 minutes (!!).

The second is that this locks the cache and makes this script needlessly stumble upon a possible regression in APT from Debian bookworm and Ubuntu 22.06:

https://bugs.launchpad.net/ubuntu/+source/apt/+bug/2003851

That still has to be confirmed: it's possible that apt update can hang for a long time, but that shouldn't concern us if we delegate this work out of band.

I also do not believe actually performing the dist-upgrade calculations is necessary to compute the pending upgrades at all. I've done work with python-apt for other projects and haven't found that to be required: the cache has the necessary information about pending upgrades.

Closes: #179

anarcat commented 1 year ago

additional data point, output is similar before/after here:

root@rdsys-test-01:~# /usr/share/prometheus-node-exporter-collectors/apt_info.py
# HELP apt_upgrades_pending Apt packages pending updates by origin.
# TYPE apt_upgrades_pending gauge
apt_upgrades_pending{origin="",arch=""} 0
# HELP apt_upgrades_held Apt packages pending updates but held back.
# TYPE apt_upgrades_held gauge
apt_upgrades_held{origin="",arch=""} 0
# HELP apt_autoremove_pending Apt packages pending autoremoval.
# TYPE apt_autoremove_pending gauge
apt_autoremove_pending 2
# HELP node_reboot_required Node reboot is required for software updates.
# TYPE node_reboot_required gauge
node_reboot_required 0
root@rdsys-test-01:~# time python3 ./apt_info.py.orig
# HELP apt_upgrades_pending Apt packages pending updates by origin.
# TYPE apt_upgrades_pending gauge
apt_upgrades_pending{origin="",arch=""} 0
# HELP apt_upgrades_held Apt packages pending updates but held back.
# TYPE apt_upgrades_held gauge
apt_upgrades_held{origin="",arch=""} 0
# HELP apt_autoremove_pending Apt packages pending autoremoval.
# TYPE apt_autoremove_pending gauge
apt_autoremove_pending 2
# HELP node_reboot_required Node reboot is required for software updates.
# TYPE node_reboot_required gauge
node_reboot_required 0

real    0m1.841s
user    0m1.066s
sys 0m0.179s
root@rdsys-test-01:~# time /usr/share/prometheus-node-exporter-collectors/apt_info.py
# HELP apt_upgrades_pending Apt packages pending updates by origin.
# TYPE apt_upgrades_pending gauge
apt_upgrades_pending{origin="",arch=""} 0
# HELP apt_upgrades_held Apt packages pending updates but held back.
# TYPE apt_upgrades_held gauge
apt_upgrades_held{origin="",arch=""} 0
# HELP apt_autoremove_pending Apt packages pending autoremoval.
# TYPE apt_autoremove_pending gauge
apt_autoremove_pending 2
# HELP node_reboot_required Node reboot is required for software updates.
# TYPE node_reboot_required gauge
node_reboot_required 0

real    0m0.458s
user    0m0.424s
sys 0m0.033s

the script also runs almost 4 times faster, naturally...

of course, the above is a rather dumb example, with no pending upgrades... i'll check fleet wide if there are changes shortly.

update: well. we don't have any pending upgrades right now, so that's zero fleet-wide, but i'm pretty confident this will Just Work, especially given jak's comment in the Debian bug tracker: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1028212#62

dswarbrick commented 1 year ago

I was never a fan of this script updating the cache, and suggested on at least one occasion that we'd be better off running the script as an unprivileged user by default, just so that it wouldn't try to update the cache.

OTOH, there will be people who run this script without asynchronous updating of the cache (via e.g. apticron or unattended-upgrades). I'd sorta prefer to put this cache-updating code behind a command-line switch (disabled by default), for such cases. And to address the issue of it hitting the repo servers too frequently, how about something like taking modulo 86400 of the current Unix timestamp, to ensure it only hits the repos once per day, regardless of how often it is run?

anarcat commented 1 year ago

On 2023-10-13 09:56:29, Daniel Swarbrick wrote:

OTOH, there will be people who run this script without asynchronous updating of the cache (via e.g. apticron or unattended-upgrades). I'd sorta prefer to put this cache-updating code behind a command-line switch (disabled by default), for such cases.

The problem with this is that it's a policy decision that's largely disconnected from when you want to report on the actual state of the system.

In the Debian package, for example, this script runs every fifteen minutes. If you run apt-update every fifteen minutes on all your machines, you are going to hurt the mirrors, and you're being a bad citizen. You shouldn't do that.

We were doing this, but we didn't even know we were doing this. We were just running this small metric here, surely that causes no harm?

I really don't think this should run apt-update. It's slow, and it's hurting the mirrors, it's just a plain bad idea.

If people need a job to update the cache, they should make a job to update the cache, i don't get why it would be the exporter's job to do so.

The previous shell implementation of this, by the way, did not run apt update either, so I consider this a serious regression of the apt update rewrite.

I'm not completely closed to rewriting this to have a commandline flag to enable apt update, but I think it's overkill, and I am not particularly interested in working on this myself.

Finally, keep in mind this patch is aimed for inclusion in a Debian stable release. The more complex we make it, the more trouble it will be to ship there (and elsewhere).

dswarbrick commented 1 year ago

I don't see that it would be too complex to simply ensure that the cache update is not performed unless explicitly opted into (for those who need it). By making the flag disabled by default, no extra changes to the .service would be needed. It would merely be a behaviour change (albeit one that could be reverted, unlike stripping out the functionality entirely).

I suggested such a flag several months ago: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1028212#10

This repo has no tags, so is effectively a rolling release. Therefore what Debian stable or any other distro which packages these scripts does is up to that distro.

anarcat commented 1 year ago

On 2023-10-13 11:31:43, Daniel Swarbrick wrote:

[...]

This repo has no tags, so is effectively a rolling release. Therefore what Debian stable or any other distro which packages these scripts does is up to that distro.

I am a Debian developer. I am providing this very patch as a hotfix for this problem in Debian specifically. If you want to make it more elaborate, I guess it's your call, but I would suggest taking the patch as is and adding the extra feature on top after, that way we can get moving here... :)

Are we in agreement that, by default, we shouldn't run apt-get update here? This is what this patch does...

anarcat commented 1 year ago

Also, I did a bit of research and the normal way of ensuring the package list is up to date on APT systems is to enable the APT::Periodic::Update-Package-Lists setting in apt.conf. That's as simple as:

echo 'APT::Periodic::Update-Package-Lists "1";' > /etc/apt/apt.conf.d/99_auto_apt_update.conf

I hardly see why this specific script should do that job instead...

anarcat commented 1 year ago

I think one key concept that I might not have mentioned here is that this script reports on two different tasks: updating the package list and actually upgrading pending packages. The latter can be done pretty much any time an admin passes by a server or unattended-upgrades runs, and it's worth running often to check for that. The former, however, is needless to run repeatedly because mirrors themselves (heck, ftp-master.debian.org even) do not upgrade that often (dinstall runs every 15 minutes if my memory is correct).

So we really shouldn't do any action here: just like this script doesn't (and shouldn't!) perform actual upgrades, I don't think it should update the package list either. I really like the idea of having the script deliberately readonly, and I think it's going to make everything simpler if we keep it like that.

But again, if you want to take this further, feel free... I just don't want to get bogged down in those implementation details right now.

SuperQ commented 1 year ago

Yes, I would prefer users update the apt-cache on their own schedule and this script does not. It's much better to have the package cache run async from the script and avoid side effects.

Even behind a flag, I don't think it's good to include it. There are lots of better ways to deal with updating the apt package cache.

anarcat commented 1 year ago

Would you mind adding the apt config notes as a documentation comment to the top of the script?

done.

anarcat commented 1 year ago

thanks for the review and (tacit?) approval, @dswarbrick :) i integrated your changes in the commit with credit in the commitlog.

anarcat commented 1 year ago

i think this is now ready to merge, @SuperQ can you push the magic button? :)