saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Install Salt from the Salt package repositories here:
https://docs.saltproject.io/salt/install-guide/en/latest/
Apache License 2.0
14.19k stars 5.48k forks source link

Reasons for asynchronous Pillar refresh and behavior of modules/saltutil.sync_all #50105

Open afischer-opentext-com opened 6 years ago

afischer-opentext-com commented 6 years ago

Hi,

I am currently attempting to understand issues I see in regards to not-up-to date pillar in combination with the use of saltenv/pillarenv. Currently I am not in the position to file a valid defect due to lack of understanding of how and why the code behaves (by purpose).

Our setup is that we have all runners, states, modules, pillar and grain data in a git repository and highly rely on the saltenv/pillarenv feature to segregate different versions of the code base. As we highly rely on the saltenv feature (salt master configuration pillarenv_from_saltenv: True) we perform in front of any highstate a call saltutil.sync_all(refresh=True, saltenv=some_value) to ensure the highstate is using the expected custom modules, grains, pillar etc, available in the mentioned saltenv.

For some reason we observe from time to time that the pillar or grain data is not accurate and as expected. When reviewing the code of the saltutil.sync_all method I got some questions. The method includes this code

    if refresh:
        refresh_modules()
        refresh_pillar()

This means the refresh_modules call happens asynchronous and the method immediately returns before starting the refresh_pillar. Is this good? Wouldn't it make sense to be sure the modules and grains are refreshed before refreshing the pillar? I am not sure what the methods are exactly doing but I can imagine a situation in which Jinja code is rendering pillar data using custom module code or grains. If this module code is not confirmed to be current, pillar data might be not accurate.

Is the refresh_modules() call really needed here? Isn't refresh_modules a part of refresh_pillar already?

My other question is: Why is there no way to refresh pillar in a synchronous way? Currently we workaround this by adding a time out of some seconds but consider this only a hack which - even worse - still does not reliably work for some reasons we do not yet understand. This might be also caused by Jinja magic performing delaying calls but we are not yet sure about it.

Any input which may increase the understanding is highly appreciated.

The currently used Salt version is 2017.7.5

Ch3LL commented 6 years ago

I am not sure what the methods are exactly doing but I can imagine a situation in which Jinja code is rendering pillar data using custom module code or grains. If this module code is not confirmed to be current, pillar data might be not accurate.

refresh_modules is calling self._load_modules(force_refresh, notify=notify) which just returns the modules and returners which can be loaded. It is not actually rendering jinja at this step.

Is the refresh_modules() call really needed here? Isn't refresh_modules a part of refresh_pillar already?

looks like you are indeed correct. the pillar_refresh does eventually call self.module_refresh in minion.py so looks like it is redundant in this case.

Why is there no way to refresh pillar in a synchronous way?

Looks like this is currently possible for refresh_modules right now and that was added here: https://github.com/saltstack/salt/pull/20694 so we might be able to add that to refresh_pillar as well.

I do want to help you dive into your actual issue: "For some reason we observe from time to time that the pillar or grain data is not accurate and as expected. " Can you elaborate on this a bit more? Can you include a use case or logs when this occurs?

afischer-opentext-com commented 6 years ago

Hi, thanks very much for the details! I am still working to isolate the issue further. Imho it would indeed help to have the refresh_pillar synchronously available. It's a bit challenging due to the amount of available saltenv, high amount of calls and number of minions we have here to find the relevant logs. One piece I still don't get is e.g. which component is finally really processing/rendering the pillar.

Is it the master and the minion (and both ideally compute the (eventually) same data)? Is it the minion which than sends the pillar to the master for further reuse? Is it the master using data from the minion?

I am asking this to get a better catch on what to look for.

We are only at the beginning of our analysis at the moment. Basically the issue is that from time to time we see that minions either lack pillar information. At the same time we also see that the minion synchronizes environments it should not know about at all. That may be a different issue or not. Most of the time everything works as expected, sometimes not. I know I am very unspecific at the moment, sorry for that.

afischer-opentext-com commented 6 years ago

I believe we found the causes for our issues. It is my understanding now that the salt-master only compiles the pillar using grain and configuration information of the minion. Basically our issue was a combination of pillar merge strategies not fitting our use case.

Still, I think it would be helpful to have the refresh_pillar in some way synchronous. It would already help if a event would be fired on the master bus so a runner could catch it. This event could then be used by a runner (using saltmod.wait_for_event()).

The event would need to be fired either in minion.py\Minion.pillar_refresh() , right?

What would be the chances a PR in that direction would be approved?

Ch3LL commented 6 years ago

without diving into the code I believe that would be the area. ping @saltstack/team-core might be able to show you exactly where. I will go ahead and approve this as a feature request :)

gtmanfred commented 6 years ago

you do not need the runner.

What you should do is exactly what the refresh_modules does when it asyncronous is False.

https://github.com/saltstack/salt/blob/2018.3/salt/modules/saltutil.py#L979

            eventer = salt.utils.event.get_event('minion', opts=__opts__, listen=True)
            ret = __salt__['event.fire']({'notify': True}, 'module_refresh')
            # Wait for the finish event to fire
            log.trace('refresh_modules waiting for module refresh to complete')
            # Blocks until we hear this event or until the timeout expires
            eventer.get_event(tag='/salt/minion/minion_mod_complete', wait=30)

You need to make sure to listen to the minion event bus, then you fire an event to refresh pillar, exactly like it does now, and then you listen for the event for when the pillars are completed refresh.

Make sure to initialize the event listener first, because if it returns to quickly you might miss the event if you try to initialize the listener after the refresh event has been fired.

gtmanfred commented 6 years ago

You will also need to add fireing the event, probably from the _gather_pillar() function in salt.minion, the same way that the minion_mods loader fires an event when notify is set to True.

https://github.com/saltstack/salt/blob/2018.3/salt/loader.py#L263

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

stale[bot] commented 4 years ago

Thank you for updating this issue. It is no longer marked as stale.