Closed ahammond closed 6 years ago
@markahopper This one is a show-stopper for us.
@white-hat
@ahammond would you mind sharing the pillar that is causing this issue?
also is this consistent behavior with the same minions? Are your minions all the same version?
Whoops. Sorry, updated description. Also, this pillar generally works (picking a minion at random):
ahammond@salt:~$ sudo salt hss4 pillar.get salt_map
hss4:
----------
lookup:
----------
junk:
junk
ahammond@salt:~$ sudo salt hss4 pillar.get salt_map:lookup:junk
hss4:
junk
Okay I've tried to setup a test case and I am not able to replicate this behavior. Maybe you can point me in the right direction as to whether my test case is wrong. Here is my test case:
5 minions:
[root@ip-10-63-90-123 ~]# salt-key
Accepted Keys:
ch3ll-master-salt
ch3ll-web1
ch3ll-web2
ch3ll-web3
ch3ll-web4
I'm wondering if 5 might not be enough. Do you see this behavior whith a smaller group of minions? I can also try to add more if needed.
Using the pillar you posted in the example above and running pillar.get:
[root@ip-10-63-90-123 ~]# sudo salt --out=text -b 50 \* pillar.get salt_map:lookup:junk NOPE | grep -v junk
Executing run on ['ch3ll-web2', 'ch3ll-web3', 'ch3ll-web1', 'ch3ll-master-salt', 'ch3ll-web4']
retcode: 0
retcode: 0
retcode: 0
retcode: 0
retcode: 0
As shown above I never see something similar to hss386: {}
. In fact when I completely remove the pillar from the minions and only target one in the pillar top file I still od not see {}
as shown below:
[root@ip-10-63-90-123 ~]# cat /srv/pillar/top.sls
base:
'ch3ll-web3':
- salt_map
[root@ip-10-63-90-123 ~]# sudo salt --out=text -b 50 \* pillar.get salt_map:lookup:junk NOPE | grep -v junk
Executing run on ['ch3ll-web2', 'ch3ll-web3', 'ch3ll-web1', 'ch3ll-master-salt', 'ch3ll-web4']
ch3ll-master-salt: NOPE
retcode: 0
retcode: 0
ch3ll-web2: NOPE
ch3ll-web1: NOPE
retcode: 0
retcode: 0
ch3ll-web4: NOPE
retcode: 0
or smaller batch size:
[root@ip-10-63-90-123 ~]# sudo salt --out=text -b 2 \* pillar.get salt_map:lookup:junk NOPE | grep -v junk
Executing run on ['ch3ll-web2', 'ch3ll-web3']
retcode: 0
retcode: 0
ch3ll-web2: NOPE
Executing run on ['ch3ll-web1', 'ch3ll-master-salt']
ch3ll-master-salt: NOPE
retcode: 0
ch3ll-web1: NOPE
retcode: 0
Executing run on ['ch3ll-web4']
ch3ll-web4: NOPE
retcode: 0
Any information to help me replicate this?
@Ch3LL So... the NOPE should never appear since every minion should be able to find salt_map:lookup:junk. Seems to me you've replicated at least that failure. Getting the empty dictionaries... now I don't know how to get that to replicate.
apologies I don't think i was very clear on my test case. Where NOPE shows up is during the test case where i only add the pillar to ch3ll-web03 and is not applied to any of hte other minions. I was trying to create an instance were i would see the empty {}
. so this would not be a replication of the issue from what i can tell. From what I understand from your use case is the pillar should be applied to all minions.
when the all minions have that pillar applied I see the following:
[root@ip-10-63-90-123 ~]# sudo salt --out=text -b 50 \* pillar.get salt_map:lookup:junk NOPE | grep -v junk
Executing run on ['ch3ll-web2', 'ch3ll-web3', 'ch3ll-web1', 'ch3ll-master-salt', 'ch3ll-web4']
retcode: 0
retcode: 0
retcode: 0
retcode: 0
retcode: 0
Do you have anything unusual setup with your pillar setup? maybe gitfs or pillar caching similar to this
Also anything relevant in the debug logs on master or minion side?
This pillar comes from a git ext_pillar. We have other ext pillars. The code for the custom ext_pillars hasn't changed in over 4 months.
Ok, following our switch back to 0mq, this doesn't appear to be replicating anymore.
So it seems this is an issue with the tcp transport then? I will label as a tcp transport issue then. We will keep open since this still will need to be fixed by users that are using tcp transport.
Nope, we're seeing it again using 0mq. I think this is more to do with batch mode's return than anything else. I think the tcp transport just made it more obvious. Here's some partial data from a batched run recently. I'll get a better example later today.
retcode: 0
hss392: Stopping Clam AntiVirus Daemon: [ OK ]
hss374: Stopping Clam AntiVirus Daemon: [ OK ]
retcode: 0
hss389: Stopping Clam AntiVirus Daemon: [ OK ]
retcode: 0
retcode: 0
hss140: Stopping Clam AntiVirus Daemon: [FAILED]
Executing run on ['hss445', 'hss409', 'hss151', 'hss402', 'hss401', 'hss152']
retcode: 0
hss89: Stopping Clam AntiVirus Daemon: [ OK ]
Executing run on ['hss407']
hss1094: {}
hss1097: {}
hss1091: {}
hss1090: {}
hss1093: {}
hss1092: {}
hss171: {}
hss1099: {}
hss1098: {}
hss319: {}
hss314: {}
hss315: {}
hss312: {}
hss494: {}
hss413: {}
hss395: {}
hss396: {}
hss397: {}
hss417: {}
hss414: {}
hss419: {}
hss398: {}
hss250: {}
hss169: {}
hss1083: {}
hss74: {}
hss1086: {}
hss1087: {}
hss1084: {}
hss1085: {}
hss1088: {}
hss1089: {}
hss79: {}
hss309: {}
hss308: {}
hss6: {}
hss303: {}
hss302: {}
hss307: {}
hss306: {}
hss305: {}
hss304: {}
hss383: {}
hss269: {}
hss381: {}
hss380: {}
hss387: {}
hss386: {}
hss176: {}
hss384: {}
hss178: {}
hss261: {}
hss262: {}
hss264: {}
hss265: {}
hss266: {}
hss466: {}
hss170: {}
hss465: {}
hss462: {}
hss463: {}
hss382: {}
hss172: {}
hss468: {}
hss469: {}
hss173: {}
hss175: {}
hss277: {}
hss276: {}
hss275: {}
hss274: {}
hss272: {}
hss271: {}
hss270: {}
hss260: {}
hss378: {}
hss379: {}
hss376: {}
hss377: {}
hss373: {}
hss1101: {}
hss1100: {}
hss849: {}
hss841: {}
hss412: {}
hss1118: {}
hss179: {}
hss393: {}
hss109: {}
hss1126: {}
hss448: {}
Executing run on ['hss406', 'hss405', 'hss148', 'hss481', 'hss480', 'hss487', 'hss520', 'hss521', 'hss523', 'hss769', 'hss803', 'hss802', 'hss437', 'hss13', 'hss12', 'hss11', 'hss439', 'hss1122', 'hss452', 'hss403', 'hss129', 'hss122', 'hss150', 'hss331', 'hss333', 'hss332']
hss817: {}
hss1117: {}
hss445: {}
hss152: {}
hss401: {}
hss402: {}
hss151: {}
hss409: {}
hss407: {}
hss332: {}
hss122: {}
hss331: {}
hss333: {}
hss129: {}
hss148: {}
hss439: {}
hss11: {}
hss12: {}
hss13: {}
hss452: {}
hss437: {}
hss1122: {}
hss802: {}
hss803: {}
hss523: {}
hss521: {}
hss520: {}
hss487: {}
hss480: {}
hss481: {}
hss405: {}
hss406: {}
hss150: {}
hss403: {}
hss769: {}
While there are a few places where this could occur, the most common place where we'll see the client yield an empty dictionary like that is if it can't find the job in the master job cache. Do you ever see a warning message from the client that says something like jid does not exist
or Returner unavailable
?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.
Description of Issue/Question
While massively applying a state, salt silently fails to find / apply pillars on ~2% of targets, which leads to misconfigured hosts.
Setup
Relevant snippet only. Note that this pillar has been in place since at least July 7th so it should be visible on the minions by now.
pillar/top.sls
pillar/salt_map/init.sls
Steps to Reproduce Issue
First without batch
So, this is ugly, but having a handful of un-connected minions is no big deal. Now with batch mode (the salt master was restarted about an hour or two ago, so it should be about as stable as it can be). I've grep'd out the "junk" as they are successes.
retcode: 0
is probably also a success.Versions Report
Master (We have applied patch https://github.com/saltstack/salt/pull/36024 to the master, but not any minions)
Minions