Deprecation of systemd::service_limits unexpectedly causes service restarts in some scenarios

tedgarb commented 1 month ago

Affected Puppet, Ruby, OS and module versions/distributions

Puppet: 7.27.0
Ruby: bundled
Distribution: RHEL8
Module version: 7.0.0

How to reproduce (e.g Puppet code you use)

The simplest example I have is this, from a profile I use to manage oracle database services for our DBA team:

# Create oracle service files
  systemd::unit_file { 'oracle.service':
    content => epp('puppet_dba/oracle.service.epp', {
        disable_host_conditional => $disable_service_host_conditional
    }),
    enable  => true,
  }
  systemd::service_limits { 'oracle.service':
    limits          => {
      'LimitNOFILE'  => '4096:65536',
      'LimitNPROC'   => '2047:16348',
      'LimitMEMLOCK' => '64G',
      'LimitSTACK'   => '10240K',
    },
    restart_service => false,
  }

What are you seeing

When I set up an environment that bumps this module from 6.6.0 to 7.0.0, the file /etc/systemd/system/oracle.service.d/90-limits.conf is reordered, seemingly due to the switch from an erb template being processed and then shoved into a systemd::dropin_file to being sent to a systemd::manage_dropin. This on its own is annoying, but not a critical issue. However, because manifests/service_limits.pp sets notify_service => true inside the systemd::manage_dropin with no mechanism to override it, the service would be restarted if I allowed this to apply.

What behaviour did you expect instead

I would expect that the service would not be restarted due to a module version upgrade.

Output log

(there are several other affected units that are the same structure as the above snippet):

[root@dba-db-d06 ~]# puppet agent --test --environment=control_egarbade_systemd --noop
Info: Using environment 'control_egarbade_systemd'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Applying configuration version '1716315919'
Notice: /Stage[main]/Puppet_dba::Profile::Oracle::Base_server/Systemd::Service_limits[oracle.service]/Systemd::Manage_dropin[oracle.service-90-limits.conf]/Systemd::Dropin_file[oracle.service-90-limits.conf]/File[/etc/systemd/system/oracle.service.d/90-limits.conf]/content:
--- /etc/systemd/system/oracle.service.d/90-limits.conf 2023-10-24 09:53:02.574824879 -0400
+++ /tmp/puppet-file20240521-3598797-oo4lf0 2024-05-21 14:25:38.884260476 -0400
@@ -1,6 +1,8 @@
-# This file managed by Puppet - DO NOT EDIT
+# Deployed with puppet
+#
+
 [Service]
-LimitMEMLOCK=64G
 LimitNOFILE=4096:65536
 LimitNPROC=2047:16348
+LimitMEMLOCK=64G
 LimitSTACK=10240K

Notice: /Stage[main]/Puppet_dba::Profile::Oracle::Base_server/Systemd::Service_limits[oracle.service]/Systemd::Manage_dropin[oracle.service-90-limits.conf]/Systemd::Dropin_file[oracle.service-90-limits.conf]/File[/etc/systemd/system/oracle.service.d/90-limits.conf]/content: current_value '{sha256}21be4811335038c161f4118b2da22e7c6ab67fc1ef079a7d5dcb49a0c5c1651b', should be '{sha256}0e9fdf5fd6e73f4a43f00096c7c6728537314954a8586247659b96dbce516ee3' (noop)
Info: /Stage[main]/Puppet_dba::Profile::Oracle::Base_server/Systemd::Service_limits[oracle.service]/Systemd::Manage_dropin[oracle.service-90-limits.conf]/Systemd::Dropin_file[oracle.service-90-limits.conf]/File[/etc/systemd/system/oracle.service.d/90-limits.conf]: Scheduling refresh of Service[oracle.service]
Info: /Stage[main]/Puppet_dba::Profile::Oracle::Base_server/Systemd::Service_limits[oracle.service]/Systemd::Manage_dropin[oracle.service-90-limits.conf]/Systemd::Dropin_file[oracle.service-90-limits.conf]/File[/etc/systemd/system/oracle.service.d/90-limits.conf]: Scheduling refresh of Systemd::Daemon_reload[oracle.service]
Notice: Systemd::Daemon_reload[oracle.service]: Would have triggered 'refresh' from 1 event
Info: Systemd::Daemon_reload[oracle.service]: Scheduling refresh of Exec[systemd-oracle.service-systemctl-daemon-reload]
Notice: /Stage[main]/Puppet_dba::Profile::Oracle::Base_server/Systemd::Unit_file[oracle.service]/Systemd::Daemon_reload[oracle.service]/Exec[systemd-oracle.service-systemctl-daemon-reload]: Would have triggered 'refresh' from 1 event
Notice: Systemd::Daemon_reload[oracle.service]: Would have triggered 'refresh' from 1 event
Info: Systemd::Daemon_reload[oracle.service]: Scheduling refresh of Service[oracle.service]
Info: Systemd::Daemon_reload[oracle.service]: Scheduling refresh of Service[oracle.service]
Notice: /Stage[main]/Puppet_dba::Profile::Oracle::Base_server/Systemd::Unit_file[oracle.service]/Service[oracle.service]: Would have triggered 'refresh' from 3 events
Notice: Systemd::Unit_file[oracle.service]: Would have triggered 'refresh' from 2 events
Notice: Systemd::Dropin_file[oracle.service-90-limits.conf]: Would have triggered 'refresh' from 1 event
Notice: Systemd::Manage_dropin[oracle.service-90-limits.conf]: Would have triggered 'refresh' from 1 event
Notice: Systemd::Service_limits[oracle.service]: Would have triggered 'refresh' from 1 event
Notice: Class[Puppet_dba::Profile::Oracle::Base_server]: Would have triggered 'refresh' from 2 events
[...]
Notice: Stage[main]: Would have triggered 'refresh' from 2 events
Notice: Applied catalog in 16.90 seconds
[root@dba-db-d06 ~]#

Any additional information you'd like to impart

In an attempt to front-run the deprecation warning, I attempted simply proactively replacing the systemd::service_limits declaration with a systemd::manage_dropin one, as the deprecation warning suggests, and manually changing notify_service => false, but #406 means this does not work without also setting daemon_reload => false, which is not an optimal solution as it leads to the need to manually run daemon_reload on these hosts if the file changes in the future.

Specifically, this causes a restart:

systemd::manage_dropin { 'oracle.service-90-limits.conf':
    ensure         => present,
    unit           => 'oracle.service',
    filename       => '90-limits.conf',
    service_entry  => {
      'LimitNOFILE'  => '4096:65536',
      'LimitNPROC'   => '2047:16348',
      'LimitMEMLOCK' => '64G',
      'LimitSTACK'   => '10240K',
    },
    notify_service => false,
  }

And this does not:

systemd::manage_dropin { 'oracle.service-90-limits.conf':
    ensure         => present,
    unit           => 'oracle.service',
    filename       => '90-limits.conf',
    service_entry  => {
      'LimitNOFILE'  => '4096:65536',
      'LimitNPROC'   => '2047:16348',
      'LimitMEMLOCK' => '64G',
      'LimitSTACK'   => '10240K',
    },
    notify_service => false,
    daemon_reload  => false,
  }

traylenator commented 1 month ago

What I never understood was why in v6.6.0

https://github.com/voxpupuli/puppet-systemd/blob/v6.6.0/manifests/service_limits.pp

this had a hardcoded notify_service => true,

v7 copied the behaviour of v6.6.0.

ekohl commented 1 month ago

Looks like you can blame me for https://github.com/voxpupuli/puppet-systemd/commit/8a9e03a4d6a58a5a5083d7ead60dcbc87f727d2f. It's been present since version 4.0.0.

tedgarb commented 1 month ago

I'm not sure if it is inherently bad behavior, especially given the bit in #406 that would have meant making it toggle-able would not have had the expected effect. Perhaps something to parameterize in 7.x. I'm happy to work on a PR to make it adjustable, assuming #406 gets processed in some form.

Changing it to default to false would nominally fix the problem in my original issue, but would be a new change in behavior that seems ill-advised outside a major version.

ekohl commented 1 month ago

It feels to me that this is a bit of an edge case. The service limits are deprecated so I'm not sure if we shouldn't just encourage users to migrate away from it, like you did.

tedgarb commented 1 month ago

While I agree the restart parameter discussion is somewhat moot given the deprecation, I do not think this should derail the higher order problem, which is that 7.0.0, in addition to introducing a deprecation warning, causes in situations like the above services to restart unexpectedly, with no way to preemptively detect this short of auditing and validating your entire site's configuration tree and proactively modifying the declarations. Since this issue seems to be caused most directly by the switch inside 988d2c7 from systemd::dropin_file to systemd::manage_dropin when providing limits via the limits parameter, I would request that narrow part of the broader commit be reverted, in the absence of a more natural way to prevent unexpected restarts. I don't see an obvious way to change the behavior of systemd::manage_dropin to prevent restarts for people using that already, so the obvious path to me is to switch back to the 6.6.0 method for generating the dropin file within systemd::service_limits. I understand that this would involve either reinstating the erb template or writing a new one in epp, but it would be a lot easier for me and I am sure other users if there was a path from 6.6.0 to 7.x that allowed us to simply catch and post-fix the deprecation alerts, rather than having to preemptively analyze our entire sites.

tedgarb commented 1 month ago

Thinking about this a bit further, I do not actually think telling users to switch from service_limits to dropin_file is really an equivalent experience, because the parameterized hash that service_limits provides plays nicely with hiera and the hierarchical nature of parameters, and dropin_file really just wants string content. I think what I will be doing personally is waiting for #406 to get released, and then forcing my puppet userbase to migrate their service_limits to manage_dropin before I switch the module we offer to 7.x, giving them the option to control if and when their services restart.

The only thing I could see to ease the transition and provide obvious before/after guidance would be to make service_limits::restart_service propagate, but I see some obvious potential risks in making the false value change behavior within 7.x for some use cases, so I'm not going to actually say that should be considered.

I would like to humbly request a more explicit warning be added to the 7.x release notes warning that users of service_limits may see units restart as a result of upgrading the systemd module, and possibly some better notes based on this ticket about what can be done to prevent it, especially once preventing it does not require disabling daemon_reload as it does in 7.0.0

tedgarb commented 1 month ago

With 406 merged and the readme providing more explicit instructions on how to prep for this, the about-to-be-released 7.1.0 version would have been enough for me to avoid having to open this issue, so I'm closing it out. Thanks for listening to me ramble :)

voxpupuli / puppet-systemd