saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Install Salt from the Salt package repositories here:
https://docs.saltproject.io/salt/install-guide/en/latest/
Apache License 2.0
14.19k stars 5.48k forks source link

service does not restart after watched config #14183

Closed kevich closed 5 years ago

kevich commented 10 years ago

os: opensuse service_provider: rh_service salt-call --versions-report Salt: 2014.1.5 Python: 2.7.2 (default, Aug 19 2011, 20:41:43) [GCC] Jinja2: 2.6 M2Crypto: 0.21.1 msgpack-python: 0.2.4 msgpack-pure: Not Installed pycrypto: 2.3 PyYAML: 3.10 PyZMQ: 13.0.0 ZMQ: 3.2.2

scenario:

if we edit pg_hba so file.replace runs again then service restarts ok, but not for the first run. I think it's somehow related to postgres_first_run starting service, but can't find anything relevant in the logs

example states from debug output:

postgresql:
  pkg:
    - installed
    - name: postgresql-server
  service:
    - running
    - enable: True
    - watch:  
      - file: /var/lib/pgsql/data/pg_hba.conf 

/var/lib/pgsql/data/pg_hba.conf: 
  file.replace:
    - pattern: (^host.*127\.0\.0\.1/32\s*).*$ 
    - repl: \1md5 
    - watch: 
      - pkg: postgresql 

postgres_first_run: 
  cmd.wait:
    - name: /etc/init.d/postgresql start >/dev/null 2>&1; sleep 5; echo -e '\nchanged=yes\n' 
    - stateful: True 
    - watch: 
      - pkg: postgresql 
    - watch_in: 
      - file: /var/lib/pgsql/data/pg_hba.conf 
cmclaughlin commented 5 years ago

Commenting on my phone and can test more later... but I would like to clarify the problem is the service is not reloaded after new files are added - where there is a watch between the new file and the already running service.

I don’t think the last Apache test mentioned above demonstrates the problem I’m seeing because Apache will always serve the contents of a html file. A service restart is not needed for that.

waynew commented 5 years ago

@cmclaughlin Yes, but out of a random directory?

where there is a watch between the new file and the already running service.

That's the crux of the matter. It the service isn't already running in your test. The program gets installed (1) the file gets added (2) and the service gets started (3).

There's nothing for salt to restart because the file already existed, in the correct state, before the service was started. Restarting the service won't actually do anything, because there's nothing to do.

If you want to see what I mean - comment this block out:

/tmp/testwatch:
  file.managed:
    - contents: hello
    - watch_in:
      - service: apache-service

Then run state.apply/highstate. Then add it back in and run your state (ensuring, of course, that /tmp/testwatch doesn't already exist). You should see your service restart - that's what I've seen in every test I've tried.

cmclaughlin commented 5 years ago

I'll take another look today.. sorry, I didn't take a close look at the centos example over the weekend / reading on my phone. I see your point about serving out of a random directory... that makes sense now.

cmclaughlin commented 5 years ago

Installing Vagrant now... but I think this is where our expectations are different:

That's the crux of the matter. It the service isn't already running in your test. The program gets installed (1) the file gets added (2) and the service gets started (3).

Based on my example above, here's what I expect:

1) Package gets install 2) Service gets started 3) File with watch gets added 4) Service restarts because of watch on service

cmclaughlin commented 5 years ago

Thinking it through a bit more, it makes sense that I don't see the problem on Debian based systems since apt packages typically start the service automatically.

I suppose I'm proposing Salt's requisite system should handle this special case.

waynew commented 5 years ago

@cmclaughlin and that's what I've failed to communicate well :trollface:

Salt doesn't need to handle the special case because nothing wrong is happening. It's just weird, because it's not what we're used to seeing.

We're used to seeing your expectations - package installed, service gets started, file gets added, and service gets restarted. Why is the service restarted? Because it needs to load the changed file. And for Debian systems, apparently that's the norm - installing the package starts the service.

What we're seeing here is actually OK - because the package is installed, the files get updated, and then the service gets started. It would be silly to start the service just to restart it after adding files, or to add a file, start a service, and then restart the service just because we've already added the file πŸ˜›

I think we find it so weird an unexpected because when you or I manually install a server we're going to fire it up first to make sure it's working, then we'll futz with the config files, and then restart the server. When we do our setup in Salt... we've pre-futzed our files :trollface: πŸ˜†

cmclaughlin commented 5 years ago

Glad we're on the same page now...

I'd like to come up with a way for Salt to handle this situation... at the very least we should have some documentation that this is a common problem which can lead to very undesirable consequences.

I could easily follow the onchanges example/work-around posted above... but I'm thinking service.running should start the service ASAP (by injecting another entry into the low state?) in this situation. Perhaps service.running should have an additional option like eary_start?

I started poking around at the state and service sources... but I don't have a very good grasp on the code flow here. If anyone has ideas how to improve this I'd love to know.

waynew commented 5 years ago

I'm curious - what undesirable consequences are you imagining would happen here?

cmclaughlin commented 5 years ago

Thanks for your patience... I'm all over the place on this one. Maybe my internal code doesn't match up with the tests we've come up with here. I'll take a closer look at my internal Salt code and see if I can track down the problem.

cmclaughlin commented 5 years ago

Thanks again for your patience. I took a look at my internal salt repo and I think I can finally provide a decent example.

In at least one place we have the problem we started the service manually before service.running was called, but didn't enable the service.

I've concluded:

Example:

apache-pkg:
  pkg.installed:
    - name: httpd24

# imagine module installation here
restart-apache-after-module-installation:
  cmd.run:
    - name: service httpd restart

apache-service:
  service.running:
    - name: httpd
    - enable: True

/etc/httpd/foo.crt:
  file.managed:
    - contents: hi
    - watch_in:
      - service: apache-service

State output:

----------
          ID: apache-pkg
    Function: pkg.installed
        Name: httpd24
      Result: True
     Comment: The following packages were installed/updated: httpd24
     Started: 06:23:03.421665
    Duration: 8289.552 ms
     Changes:
              ----------
              httpd24:
                  ----------
                  new:
                      2.4.37-1.83.amzn1
                  old:
----------
          ID: restart-apache-after-module-installation
    Function: cmd.run
        Name: service httpd restart
      Result: True
     Comment: Command "service httpd restart" run
     Started: 06:23:11.714577
    Duration: 134.233 ms
     Changes:
              ----------
              pid:
                  29860
              retcode:
                  0
              stderr:
              stdout:
                  Stopping httpd: [FAILED]
                  Starting httpd: [  OK  ]
----------
          ID: /etc/httpd/wildcard.crt
    Function: file.managed
      Result: True
     Comment: File /etc/httpd/wildcard.crt updated
     Started: 06:23:11.864664
    Duration: 13.596 ms
     Changes:
              ----------
              diff:
                  New file
----------
          ID: apache-service
    Function: service.running
        Name: httpd
      Result: True
     Comment: Service httpd has been enabled, and is in the desired state
     Started: 06:23:11.878634
    Duration: 259.397 ms
     Changes:
              ----------
              httpd:
                  True

Log output:

2019-01-16 06:25:57,323 [salt.state                                                       ][INFO    ][30096] Made the following changes:
'httpd24' changed from 'absent' to '2.4.37-1.83.amzn1'

2019-01-16 06:25:57,350 [salt.state                                                       ][INFO    ][30096] Loading fresh modules for state activity
2019-01-16 06:25:57,402 [salt.state                                                       ][INFO    ][30096] Completed state [httpd24] at time 06:25:57.402692 duration_in_ms=8236.931
2019-01-16 06:25:57,406 [salt.state                                                       ][INFO    ][30096] Running state [service httpd restart] at time 06:25:57.406222
2019-01-16 06:25:57,406 [salt.state                                                       ][INFO    ][30096] Executing state cmd.run for [service httpd restart]
2019-01-16 06:25:57,410 [salt.loaded.int.module.cmdmod                                    ][INFO    ][30096] Executing command 'service httpd restart' in directory '/root'
2019-01-16 06:25:57,530 [salt.state                                                       ][INFO    ][30096] {'pid': 30143, 'retcode': 0, 'stderr': '', 'stdout': 'Stopping httpd: [FAILED]\r\nStarting httpd: [  OK  ]'}
2019-01-16 06:25:57,531 [salt.state                                                       ][INFO    ][30096] Completed state [service httpd restart] at time 06:25:57.531133 duration_in_ms=124.91
2019-01-16 06:25:57,553 [salt.state                                                       ][INFO    ][30096] Running state [/etc/httpd/foo.crt] at time 06:25:57.553259
2019-01-16 06:25:57,553 [salt.state                                                       ][INFO    ][30096] Executing state file.managed for [/etc/httpd/foo.crt]
2019-01-16 06:25:57,569 [salt.state                                                       ][INFO    ][30096] File changed:
New file
2019-01-16 06:25:57,569 [salt.state                                                       ][INFO    ][30096] Completed state [/etc/httpd/foo.crt] at time 06:25:57.569583 duration_in_ms=16.325
2019-01-16 06:25:57,570 [salt.state                                                       ][INFO    ][30096] Running state [httpd] at time 06:25:57.569983
2019-01-16 06:25:57,570 [salt.state                                                       ][INFO    ][30096] Executing state service.running for [httpd]
2019-01-16 06:25:57,571 [salt.loaded.int.module.cmdmod                                    ][INFO    ][30096] Executing command '/sbin/service httpd status' in directory '/root'
2019-01-16 06:25:57,675 [salt.loaded.int.module.cmdmod                                    ][INFO    ][30096] Executing command '/sbin/chkconfig --list httpd' in directory '/root'
2019-01-16 06:25:57,683 [salt.loaded.int.module.cmdmod                                    ][INFO    ][30096] Executing command '/sbin/runlevel' in directory '/root'
2019-01-16 06:25:57,704 [salt.loaded.int.module.cmdmod                                    ][INFO    ][30096] Executing command '/sbin/runlevel' in directory '/root'
2019-01-16 06:25:57,721 [salt.loaded.int.module.cmdmod                                    ][INFO    ][30096] Executing command '/sbin/chkconfig --list httpd' in directory '/root'
2019-01-16 06:25:57,742 [salt.loaded.int.module.cmdmod                                    ][INFO    ][30096] Executing command '/sbin/runlevel' in directory '/root'
2019-01-16 06:25:57,756 [salt.loaded.int.module.cmdmod                                    ][INFO    ][30096] Executing command '/sbin/runlevel' in directory '/root'
2019-01-16 06:25:57,778 [salt.loaded.int.module.cmdmod                                    ][INFO    ][30096] Executing command '/sbin/chkconfig --list httpd' in directory '/root'
2019-01-16 06:25:57,786 [salt.loaded.int.module.cmdmod                                    ][INFO    ][30096] Executing command '/sbin/chkconfig httpd on' in directory '/root'
2019-01-16 06:25:57,809 [salt.loaded.int.module.cmdmod                                    ][INFO    ][30096] Executing command '/sbin/chkconfig --list httpd' in directory '/root'
2019-01-16 06:25:57,815 [salt.loaded.int.module.cmdmod                                    ][INFO    ][30096] Executing command '/sbin/runlevel' in directory '/root'
2019-01-16 06:25:57,821 [salt.state                                                       ][INFO    ][30096] {'httpd': True}
2019-01-16 06:25:57,821 [salt.state                                                       ][INFO    ][30096] Completed state [httpd] at time 06:25:57.821821 duration_in_ms=251.837

After changing cmd.run to service httpd restart && /sbin/chkconfig httpd on... the service is running and enabled when the watch runs... and the watch/restart works:

State output:

          ID: apache-pkg
    Function: pkg.installed
        Name: httpd24
      Result: True
     Comment: The following packages were installed/updated: httpd24
     Started: 06:27:36.354670
    Duration: 8040.531 ms
     Changes:
              ----------
              httpd24:
                  ----------
                  new:
                      2.4.37-1.83.amzn1
                  old:
----------
          ID: restart-apache-after-module-installation
    Function: cmd.run
        Name: service httpd restart && /sbin/chkconfig httpd on
      Result: True
     Comment: Command "service httpd restart && /sbin/chkconfig httpd on" run
     Started: 06:27:44.406192
    Duration: 151.919 ms
     Changes:
              ----------
              pid:
                  30390
              retcode:
                  0
              stderr:
              stdout:
                  Stopping httpd: [FAILED]
                  Starting httpd: [  OK  ]
----------
          ID: /etc/httpd/foo.crt
    Function: file.managed
      Result: True
     Comment: File /etc/httpd/foo.crt updated
     Started: 06:27:44.569892
    Duration: 16.433 ms
     Changes:
              ----------
              diff:
                  New file
----------
          ID: apache-service
    Function: service.running
        Name: httpd
      Result: True
     Comment: Service restarted
     Started: 06:27:44.726029
    Duration: 290.689 ms
     Changes:
              ----------
              httpd:
                  True

Summary for dev-charlestest-1b053.shopstyleops.com
------------
Succeeded: 4 (changed=4)
Failed:    0
------------
Total states run:     4
Total run time:   8.500 s

Logs

2019-01-16 06:27:44,308 [salt.state                                                       ][INFO    ][30342] Made the following changes:
'httpd24' changed from 'absent' to '2.4.37-1.83.amzn1'

2019-01-16 06:27:44,334 [salt.state                                                       ][INFO    ][30342] Loading fresh modules for state activity
2019-01-16 06:27:44,395 [salt.state                                                       ][INFO    ][30342] Completed state [httpd24] at time 06:27:44.395201 duration_in_ms=8040.531
2019-01-16 06:27:44,406 [salt.state                                                       ][INFO    ][30342] Running state [service httpd restart && /sbin/chkconfig httpd on] at time 06:27:44.406191
2019-01-16 06:27:44,406 [salt.state                                                       ][INFO    ][30342] Executing state cmd.run for [service httpd restart && /sbin/chkconfig httpd on]
2019-01-16 06:27:44,410 [salt.loaded.int.module.cmdmod                                    ][INFO    ][30342] Executing command 'service httpd restart && /sbin/chkconfig httpd on' in directory '/root'
2019-01-16 06:27:44,557 [salt.state                                                       ][INFO    ][30342] {'pid': 30390, 'retcode': 0, 'stderr': '', 'stdout': 'Stopping httpd: [FAILED]\r\nStarting httpd: [  OK  ]'}
2019-01-16 06:27:44,558 [salt.state                                                       ][INFO    ][30342] Completed state [service httpd restart && /sbin/chkconfig httpd on] at time 06:27:44.558111 duration_in_ms=151.919
2019-01-16 06:27:44,569 [salt.state                                                       ][INFO    ][30342] Running state [/etc/httpd/foo.crt] at time 06:27:44.569892
2019-01-16 06:27:44,571 [salt.state                                                       ][INFO    ][30342] Executing state file.managed for [/etc/httpd/foo.crt]
2019-01-16 06:27:44,585 [salt.state                                                       ][INFO    ][30342] File changed:
New file
2019-01-16 06:27:44,586 [salt.state                                                       ][INFO    ][30342] Completed state [/etc/httpd/foo.crt] at time 06:27:44.586325 duration_in_ms=16.433
2019-01-16 06:27:44,586 [salt.state                                                       ][INFO    ][30342] Running state [httpd] at time 06:27:44.586681
2019-01-16 06:27:44,587 [salt.state                                                       ][INFO    ][30342] Executing state service.running for [httpd]
2019-01-16 06:27:44,587 [salt.loaded.int.module.cmdmod                                    ][INFO    ][30342] Executing command '/sbin/service httpd status' in directory '/root'
2019-01-16 06:27:44,705 [salt.loaded.int.module.cmdmod                                    ][INFO    ][30342] Executing command '/sbin/chkconfig --list httpd' in directory '/root'
2019-01-16 06:27:44,713 [salt.loaded.int.module.cmdmod                                    ][INFO    ][30342] Executing command '/sbin/runlevel' in directory '/root'
2019-01-16 06:27:44,725 [salt.state                                                       ][INFO    ][30342] The service httpd is already running
2019-01-16 06:27:44,725 [salt.state                                                       ][INFO    ][30342] Completed state [httpd] at time 06:27:44.725779 duration_in_ms=139.097
2019-01-16 06:27:44,726 [salt.state                                                       ][INFO    ][30342] Running state [httpd] at time 06:27:44.726029
2019-01-16 06:27:44,726 [salt.state                                                       ][INFO    ][30342] Executing state service.mod_watch for [httpd]
2019-01-16 06:27:44,727 [salt.loaded.int.module.cmdmod                                    ][INFO    ][30342] Executing command '/sbin/service httpd status' in directory '/root'
2019-01-16 06:27:44,816 [salt.loaded.int.module.cmdmod                                    ][INFO    ][30342] Executing command '/sbin/service httpd restart' in directory '/root'
2019-01-16 06:27:45,016 [salt.state                                                       ][INFO    ][30342] {'httpd': True}
2019-01-16 06:27:45,016 [salt.state                                                       ][INFO    ][30342] Completed state [httpd] at time 06:27:45.016718 duration_in_ms=290.689
2019-01-16 06:27:45,018 [salt.minion
odormond commented 5 years ago

@waynew I finally had time to come up with a streamlined setup of what we are doing here when the problem occurs. I'm attaching a tarball that makes it 100% reproducible with the help of terraform and aws (only free tier resources are used). Running the exhibit.sh script will do the trick and it should be fairly easy to follow.

In short, it spins up an ubuntu server, install the latest salt in master-less mode and copy the states files and pillar and apply the highstate. Then it will change a pillar value and re-apply the highstate at which time the bug will be triggered and the service not-restarted as expected.

In this case, it boils down to a bad interaction between onchanges and watch. Here is the sls for reference:

/tmp/test/index.html:
  file.managed:
    - makedirs: True
    - contents:
      - Hello World
{# this will break the watch on `test_running` #}
    - onchanges_in:
      - test_running

test_systemd_unit:
  file.managed:
    - name: /etc/systemd/system/test.service
    - source: salt://test.service.jinja
    - template: jinja
    - mode: 600
  module.run:
    - name: service.systemctl_reload
    - onchanges:
      - file: test_systemd_unit

test_running:
  service.running:
    - name: test
    - enable: True
    - watch:
      - module: test_systemd_unit

After changing the pillar, the salt-call output is:

local:
----------
          ID: /tmp/test/index.html
    Function: file.managed
      Result: True
     Comment: File /tmp/test/index.html is in the correct state
     Started: 15:48:28.159757
    Duration: 26.06 ms
     Changes:
----------
          ID: test_systemd_unit
    Function: file.managed
        Name: /etc/systemd/system/test.service
      Result: True
     Comment: File /etc/systemd/system/test.service updated
     Started: 15:48:28.186076
    Duration: 24.858 ms
     Changes:
              ----------
              diff:
                  ---
                  +++
                  @@ -9,7 +9,7 @@
                   Type=simple
                   Restart=on-failure
                   WorkingDirectory=/tmp/test
                  -ExecStart=/usr/bin/python3 -m http.server 8080
                  +ExecStart=/usr/bin/python3 -m http.server 8081

                   [Install]
                   WantedBy=multi-user.target
----------
          ID: test_systemd_unit
    Function: module.run
        Name: service.systemctl_reload
      Result: True
     Comment: Module function service.systemctl_reload executed
     Started: 15:48:28.212099
    Duration: 474.369 ms
     Changes:
              ----------
              ret:
                  True
----------
          ID: test_running
    Function: service.running
        Name: test
      Result: True
     Comment: State was not run because none of the onchanges reqs changed
     Started: 15:48:28.687950
    Duration: 0.015 ms
     Changes:

Summary for local
------------
Succeeded: 4 (changed=2)
Failed:    0
------------
Total states run:     4
Total run time: 525.302 ms

As you can see, the test_systemd_unit is triggered and produces changes which is picked-up by the onchanges on module.run that does reload the systemd unit file but then the test_running state is not triggered, pretending that none of the onchanges reqs changed. This is technically true as its onchanges points to the file.managed of /tmp/test/index.html which didn't change but it happily ignores the watch which does trigger it correctly if you remove the onchanges_in of the index.html.

Cheers,

Olivier

waynew commented 5 years ago

@cmclaughlin Okay, so I discovered that you are definitely finding some unexpected behavior! It took a while for me to nail down exactly what's happening - I'm not sure if it's just a bug in our documentation, or if it's an actual bug.

What's happening here is kind of crazy.

If you remove the enable: True, or split it into

enable-apache:
  service.enabled:
    - name: httpd24

Everything works πŸŽ‰

The question you might have then is, "Buy why??"

Which is a darn good question 🀣

Turns out that the answer is pretty simple though - the only thing that's going wrong here is that by starting (but not enabling) a service, when Salt looks at the service it says, "Ah, yeah, running is good. Oh, but let me enable this service. Cool, I've made some changes to it. What, you have a watch? No, I don't need to do anything, I've already changed this service!"

Oops :trollface:

I definitely agree that it seems strange, or perhaps even unintuitive, but it is at least consistent, and now I understand exactly why πŸ™‚


Just a had a quick conversation, and this is definitely intended behavior - I'm going to see about adding some information to our docs that calls it out, and perhaps gives some better explanation.

cmclaughlin commented 5 years ago

@waynew can you give an overview why it's intended behavior here?

waynew commented 5 years ago

@cmclaughlin Sure!

Watch looks for changes in one state, and if the watching state hasn't changed, calls triggers it's watching behavior.

If A is the watched state, and B is the watcher:

Does A have changes? Does B need changes? Tell B to change. Does A have changes? Has B already changed? Cool cool, nothing to do here.

That's it. That's what watching does.


In the confusing case we have the following:

Does the conf file have changes? Yeah, cool, does apache have changes? Oh, we enabled the service? Yeah, cool, nothing more to do here.

It's following the pattern, because enabling a service is a change. And this change happened after the config file was changed, so as far as watch is concerned, everything is A-OK and in the correct state.

It's confusing because we all expected service.running to be what was watched, but it's actually the state apache-running, and since the state had changes... It's a bit like a magician using misdirection, but instead of surprise and delight, we were all just annoyed 😝

The takeaway here is that the watcher is only required to change if it has not already.

apache-running had changes, and from the watch perspective, there's no way to distinguish if that change was service started, or service enabled. I don't even think there should be. I was thinking about the idea of creating an RFC to allow states to produce an "ignore this change for watches", but that would create terribly inconsistent behavior. Consider these examples:

apache enabled:
  service.enabled:
    - name: httpd24

apache running:
  service.running:
    - name: httpd24  

And

apache running and enabled:
  service.running:
    - name: httpd24
    - enable: True

If we want this to match our misunderstanding then service.running should be triggered if we just enable the service, right? Okay, say we do that.

What if someone wants to add watch_in to their apache enabled state? Are we going to say that because it's standalone it gets executed, but when it's part of another state it doesn't? Or are we just blacklisting enable service from watches/requisites?

There's really no consistent way we could change the behavior.

I do think that it's worth adding this kind of example to our docs, to explicitly call out the fact that watching is only triggered when a state has no changes, and that those changes may not match expectations.

Does that all make sense, or is there still some confusion?

cmclaughlin commented 5 years ago

Thanks, but it still doesn't really make sense to me.

I can see how there's a difference in behavior between watches on service.running and service.enabled... but if enabled is going to be a supported option on service.running I would expect it to behave exactly like a separate call to service.enabled.

The part about "enabling a service is a change" so the restart doesn't happen is totally counterintuitive to me. I would expect when file.managed is run, that's when the service status would be checked... and Salt would see the service is running, therefore a restart is required. From my end-user perspective, I don't see any reason why Salt can't enable the service and restart it. Why can only one change happen?

waynew commented 5 years ago

Thanks, but it still doesn't really make sense to me.

I think you actually just explained where our disconnect is πŸ™‚ Let's see if it helps:

The part about "enabling a service is a change" so the restart doesn't happen is totally counterintuitive to me. I would expect when file.managed is run, that's when the service status would be checked...

That's a totally natural assumption... but how would you architect that? Would you couple file.managed to service.running? Where would you stop that coupling?

and Salt would see the service is running, therefore a restart is required. From my end-user perspective, I don't see any reason why Salt can't enable the service and restart it. Why can only one change happen?

Sure! I'd love to make that possible... but how would I implement that? I started to think about how to make that work but then I ran into this problem. How would you ensure that:

apache enabled:
  service.enabled:
    - name: httpd24

apache running:
  service.running:
    - name: httpd24  

and

apache running and enabled:
  service.running:
    - name: httpd24
    - enable: True

Behave identically when they're watching another state?

I couldn't come up with any good plans - either I ended out with tight coupling, or adding a bunch of extra logic for each individual special case. Or they started making Salt behave inconsistently, which I think is probably the worst thing of all.

Do you have any ideas how we could make that happen?