sensu / sensu-go-chef

Chef Library Cookbook for Sensu Go
https://sensu.io
MIT License
11 stars 22 forks source link

Restart agent on config change to load new changes. #87

Closed kovukono closed 4 years ago

kovukono commented 4 years ago

Pull Request Checklist

No, this does not address an open issue.

General

New Features

Purpose

Sensu-Agent does not restart after a configuration change, meaning those changes don't get applied. This will restart Sensu-Agent after a change to the configuration file.

Known Compatibility Issues

derekgroh commented 4 years ago

Can you define the use case a bit more?

The logic doesn't seem to follow the purpose

The action :install will start the agent after creating the agent.yml file and then the notifies will have the service restart at the end of the run.

kovukono commented 4 years ago

That was one thing I was hesitant about. It actually isn't waiting until the end of the run, just the end of the resource (after the enable and start). I didn't want to restart the service every run, just if the file had changed, but I did want to leave the original logic that the service was always ensured to be started. I couldn't find a cleaner way of it other than this. Any ideas if it could be made smoother?

derekgroh commented 4 years ago

If your use case is updating or modifying the agent.yml a new action such as :configure may be the best way to approach this.

From there you can use a file or template resource to handle the config file and have logic to restart the service if a change is detected.

kovukono commented 4 years ago

So you're suggesting to move the template rendering out of the install action, and use a joint series of actions such as [:install, :configure] for managing agents? Install would probably still handle the enabling and start of the service, correct?

derekgroh commented 4 years ago

The :install action will require a agent.yml file on disk to complete successfully, but it could just as easily call the :configure action to complete that if that is how you would like to approach this feature request.

kovukono commented 4 years ago

I really like this approach. I'll take a look at it tomorrow.

kovukono commented 4 years ago

I tried to determine the action to take based on whether or not the configuration was updated, but both the resource's updated? and updated_by_last_action? aren't accurately reflecting whether or not the configuration file was updated, even if Chef marks it as such. Without being able to determine that, I'm not sure how to check whether or not the service needs to be restarted aside from the notifies.

majormoses commented 4 years ago

What about using a feature flag to control if we notify or not?

kovukono commented 4 years ago

With the PR as it currently is, it's only restarting if the agent.yml changes, which is what we want from the notify. Or is there something different you meant?

kovukono commented 4 years ago

Sorry for the break. I'm still unclear as to what a feature flag would accomplish, as we're trying to restart the service just if the agent.yml changes.

derekgroh commented 4 years ago

Feature flags would employ using tags to control the behavior of the chef run to skip or execute code.

agent_restart = tagged?('restart_agent')

file ‘/tmp/foo.txt’ do
  content ‘foo’
  notifies :restart, ‘service[bar]’
  action :create if agent_restart
end

service ‘bar’ do
  action :nothing
end

Alternatively you can use this example in NTP cookbook to restart a service after a file change: https://github.com/chef-cookbooks/ntp/blob/master/recipes/default.rb#L104#L149

kovukono commented 4 years ago

I'm a bit confused. Why would we add this extra logic to say whether or not we would restart the agent's service? At the moment, it never restarts it, which leads to agents never picking up modified configurations. Is there a use case that I'm missing that we would want an agent not to have the latest configuration we pushed?

majormoses commented 4 years ago

I spoke with @derekgroh on this and this what I recommend:

  1. implement it as simple as possible right now with a notifies :delayed action[1]
  2. if you think you need immediate for some reason please outline the use case why this makes sense and then add a var to control this

I originally suggested a feature flag because it seemed like there were some scenarios where that might not be desired. I think its important to look at the context context. As a resource cookbook we have to generally provide a lot of flexibility which includes enabling use cases where you want to provide guardrails. Given that this is specific to the agent file I think it would not be too bad to update that to an immediate if there was a reason to (order of operations, checking valid config, etc). If this was for other resource types such as checks, filters, handlers, etc I would be more guarded because it can lead to "thundering restarts".

The simplest solution that I am :+1: on is:

file ::File.join(new_resource.config_home, 'agent.yml') do
  content(JSON.parse(new_resource.config.to_json).to_yaml.to_s)
  notifies :restart, 'service[sensu-agent]', :delayed
end

If we need more complexity and guard rails we really need to outline the use cases so we can provide meaningful guardrails.

kovukono commented 4 years ago

I added a commit yesterday to choose when to restart it, and just now pushed another one to control whether or not it restarts at all.

majormoses commented 4 years ago

haven't had a chance to dig deeper but I smell a possible issue with an unless ... || condition It's better to use positive assertions or negations when dealing with multiple boolean conditions to avoid subtle logic flaws. See http://www.railstips.org/blog/archives/2008/12/01/unless-the-abused-ruby-conditional/ for more info on that.

kovukono commented 4 years ago

I've tested it, and like your link says, it does indeed apply the negative to all conditions. Do you want me to rework it to be only if statements?

kovukono commented 4 years ago

Do I need to rework the unless statements, or is there anything else I need to work on for the PR?

majormoses commented 4 years ago

Hey, I will probably take a look at this over the weekend, I am generally not a fan of using unless with multiple conditions. In this case I would actually break out the logic into a helper function that could be called. Also I see in the non immediate one we dont specify :delayed I assume thats the default then but I think its better to be explicit.

kovukono commented 4 years ago

@majormoses I was wondering if you had had a chance to look at this PR yet?

derekgroh commented 4 years ago

@kovukono Sorry for the delay, we discussed the PR in Slack (https://sensucommunity.slack.com/archives/C67D53SC8/p1589221631017000) and wondered if you could further detail

Otherwise, the common Chef pattern is to restart the service at the end of the chef run with the default behavior :delayed.

kovukono commented 4 years ago

@derekgroh That specific suggestion was made by @majormoses. The original PR was just to set it to run as delayed in 2adee42, but was changed to include the immediate in 8564608 after you and majormoses asked for additional control flags. If we want to go back to the first commit, where it only restarts delayed, I'm perfectly fine with that. I'd just really like to finally come to a decision on this.

majormoses commented 4 years ago

@kovukono I apologize for the confusion and frustration here. Reading back I can see how I was not as clear I could have been and being unresponsive/busy did not help. The consensus from the conversation in slack is that we need a very simple way to restart on change in a delayed fashion. I am all for whatever gets us there without breaking people.

I took another peak at the code and there are some slight challenges here that I had not accounted for when discussing this in slack. As I was not looking at anything in the PR other than the comments and suggested changes without further inspecting the code around it.

Current Behavior

This might be a bit oversimplified (especially on the windows side, I see some powershell magic I didn't look into)

  1. configures repositories for the relevant platforms
  2. install package
  3. drop agent config on disk
  4. define service in chefs resource collection, drop config on disk
  5. setup service

On the next converge we will not get a restart on the agent, agreed this is not desirable.

Problem Solving

Unfortunately it is (probably) not safe after inspecting the code to just restart as the agent config as this happens before a service is defined in chefs resource collection. Thus a change to the order of operations may help simplify things for us. I don't currently have much time to do much testing and debugging this further but I think this might help get us back on the right track, in case I rebase/amend that commit to my branch later here is a diff:

$ git diff origin/master 
diff --git a/resources/agent.rb b/resources/agent.rb
index 972057d..74dbf82 100644
--- a/resources/agent.rb
+++ b/resources/agent.rb
@@ -54,9 +54,15 @@ action :install do
       end
     end

+    # create a service resource so it can be safely called to restart on the first convergene
+    service 'sensu-agent' do
+      action :nothing
+    end
+
     # render template at /etc/sensu/agent.yml for linux
     file ::File.join(new_resource.config_home, 'agent.yml') do
       content(JSON.parse(new_resource.config.to_json).to_yaml.to_s)
+      notifies :restart, 'service[sensu-agent]', :delayed
     end

     # Enable and start the sensu-agent service
@@ -81,9 +87,15 @@ action :install do
     # Adds install directory to path
     windows_path node['sensu-go']['sensu_bindir']

+    # create a service resource so it can be safely called to restart on the first convergene
+    service 'SensuAgent' do
+      action :nothing
+    end
+
     # render template at c:\Programdata\Sensu\config\agent.yml for windows
     file ::File.join('c:/ProgramData/Sensu/config/', 'agent.yml') do
       content(JSON.parse(new_resource.config.to_json).to_yaml.to_s)
+      notifies :restart, 'service[SensuAgent]', :delayed
     end

     # Installs SensuAgent Service

Any chance someone can test this out and report back if there are issues with this approach?

derekgroh commented 4 years ago

Working on setting up a test now.

Going to test the direct code then run it through my own wrapper changing the values of config to ensure the agent is restarting after a change to agent.yml file.

kovukono commented 4 years ago

I was able to check the Slack discussion this morning. checked with your updates, and we get the following convergence with extra service declaration:

           * file[/etc/sensu/agent.yml] action create
             - create new file /etc/sensu/agent.yml
             - update content in file /etc/sensu/agent.yml from none to e3463d
             --- /etc/sensu/agent.yml   2020-05-12 09:49:55.593740491 -0400
             +++ /etc/sensu/.chef-agent20200512-8767-12u5by4.yml    2020-05-12 09:49:55.593740491 -0400
             @@ -1 +1,23 @@
             {snip}
             - restore selinux security context
           * service[sensu-agent] action enable
             - enable service service[sensu-agent]
           * service[sensu-agent] action start
             - start service service[sensu-agent]
           * service[sensu-agent] action restart
             - restart service service[sensu-agent]

Without the second service, we get this:

           * file[/etc/sensu/agent.yml] action create
             - create new file /etc/sensu/agent.yml
             - update content in file /etc/sensu/agent.yml from none to e3463d
             --- /etc/sensu/agent.yml   2020-05-12 10:19:25.794695981 -0400
             +++ /etc/sensu/.chef-agent20200512-8770-1hi19ey.yml    2020-05-12 10:19:25.793695970 -0400
             @@ -1 +1,23 @@
             {snip}
             - restore selinux security context
           * service[sensu-agent] action enable
             - enable service service[sensu-agent]
           * service[sensu-agent] action start
             - start service service[sensu-agent]
           * service[sensu-agent] action restart
             - restart service service[sensu-agent]

With or without the empty service declaration, we're getting the same functionality. I haven't been able to test for Windows, however. I'm not sure we should be double-declaring the service. The Chef documentation is also declaring the resource afterward as well (see "Notify in a specific order"). Are we sure we want to add an additional declaration?

derekgroh commented 4 years ago

We should be using a new action, we're still trying to do this under install. Correct me if I'm mistaken, but this behavior would be to reload the config after the install during the lifecycle of the agent.

majormoses commented 4 years ago

If we don't need to use the empty service declaration that's fine with me, I am not in the weeds of chef these days like I used to. I know that something like this used to fail without doing that but its very likely they have made it easier / more forgiving. I will take a look later but if the only reason to have a configure attribute is just to restart config then I'd say we add it to install, if there are other things we realistically want to do then I think breaking that into its own action might make more sense. If this was the backend I would 100% agree as you may want to control the restart logic outside of chef (to facilitate HA restarts/reloads in a cluster).

kovukono commented 4 years ago

I took a look at moving it to a configure resource:

action :configure do
  if node['platform'] != 'windows'
    # render template at /etc/sensu/agent.yml for linux
    file ::File.join(new_resource.config_home, 'agent.yml') do
      content(JSON.parse(new_resource.config.to_json).to_yaml.to_s)
      notifies :restart, 'service[sensu-agent]', :delayed
    end
  else
    # render template at c:\Programdata\Sensu\config\agent.yml for windows
    file ::File.join('c:/ProgramData/Sensu/config/', 'agent.yml') do
      content(JSON.parse(new_resource.config.to_json).to_yaml.to_s)
      notifies :restart, 'service[SensuAgent]', :delayed
    end
  end
end

However, because the service is not declared within this resource, the run fails. Adding an empty service is also failing, due to systemctl not being able to find the sensu-agent service--even though I'm calling the configure action after the package installation within the install resource, it appears to be trying to run the configure before it actually installs the yum package, or anything else within the install action.

        * sensu_agent[default] action install
           * service[sensu-agent] action nothing (skipped due to action :nothing)
           * file[/etc/sensu/agent.yml] action create
             - create new file /etc/sensu/agent.yml
             - update content in file /etc/sensu/agent.yml from none to e3463d
             --- /etc/sensu/agent.yml   2020-05-12 11:56:51.966813538 -0400
             +++ /etc/sensu/.chef-agent20200512-8774-rao1h7.yml 2020-05-12 11:56:51.966813538 -0400
             @@ -1 +1,23 @@
             {snip}
           * service[sensu-agent] action restart

             ================================================================================
             Error executing action `restart` on resource 'service[sensu-agent]'
             ================================================================================

             Mixlib::ShellOut::ShellCommandFailed
             ------------------------------------
             Expected process to exit with [0], but received '5'
             ---- Begin output of /bin/systemctl --system restart sensu-agent ----
             STDOUT: 
             STDERR: Failed to restart sensu-agent.service: Unit not found.

We can't delay the configure because the service needs to enable and start, but it needs a valid agent.yml in order to be able to start successfully. It looks as if the only way a configure action would work would be to move all service management to that action. Even then, we'd end up with the same enable -> start -> restart path, given the issue mentioned here with checking with updated? and updated_by_last_action?

derekgroh commented 4 years ago

I don't think resources within actions can notify other resources in a different action. This is what I'm currently testing, but your point is valid, should we add a guard type action to ensure the service is installed?

action :configure do
  if node['platform'] != 'windows'
    # render template at /etc/sensu/agent.yml for linux
    file ::File.join(new_resource.config_home, 'agent.yml') do
      content(JSON.parse(new_resource.config.to_json).to_yaml.to_s)
    end

    # Restart the sensu-agent service to reload config file
    service 'sensu-agent' do
      action :restart
    end
  end
...
end
kovukono commented 4 years ago

Actions definitely cannot notify actions within other resources, which was why I had to include the empty service. For yours, we wouldn't want to set the sensu-agent service to restart automatically. We instead need the agent.yml to notify the service to restart, which would have action :nothing, or else this would restart the service every Chef run.

We would definitely need a guard if we included the service there, but I really think that we'd be fine with this still within the install action, with a notification for the restart on the agent.yml. Creating the configure action means:

We'd be doing all that extra engineering for the same result of adding an extra line to the install action.

derekgroh commented 4 years ago

That's true, we're still looking for behavior on only restarting if the content is changed or anytime the agent.yml is touched regardless of content change?

kovukono commented 4 years ago

We'd just need it restarted if content changed. I don't think we'd need it restarted if, for example, we changed owner or readability, but I don't think that's an edge case we can address with a notifies. But yes, if the content is changed at all, we would want it restarted.

derekgroh commented 4 years ago

Ok, going back to https://github.com/sensu/sensu-go-chef/pull/87#issuecomment-627391494 (no :configure action), if you remove the :start from the last service 'sensu-agent' does that give you the behavior you seek?

kovukono commented 4 years ago

It should, but it would also not ensure that the agent was running if, for some reason, it was stopped and there was no change made to the agent.yml. This wouldn't be a problem on the initial run, but it could potentially cause problems afterward if manual actions were taken outside of Chef.