saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Get access to the Salt software package repository here:
https://repo.saltproject.io/
Apache License 2.0
14.14k stars 5.47k forks source link

Disappearing pillars in batch mode (aka pillar.get() Considered Harmful) #36449

Closed ahammond closed 6 years ago

ahammond commented 8 years ago

Description of Issue/Question

While massively applying a state, salt silently fails to find / apply pillars on ~2% of targets, which leads to misconfigured hosts.

Setup

Relevant snippet only. Note that this pillar has been in place since at least July 7th so it should be visible on the minions by now.

pillar/top.sls

base:
  '*':
    - salt_map

pillar/salt_map/init.sls

salt_map:
  lookup:
    junk: junk
{%- if 'VirtualBox' == grains.get('virtual', None) or 'oracle' == grains.get('virtual', None) %}
    master: 192.168.33.2
    worker_threads: 3
{%- endif %}

Steps to Reproduce Issue

First without batch

ahammond@salt:~$ sudo salt --out=text \* pillar.get salt_map:lookup:junk NOPE | grep -v junk | grep -vF "No response"
hss404: Minion did not return. [Not connected]
intel-tvpn-central-2: Minion did not return. [Not connected]
hss727: Minion did not return. [Not connected]
hss1096: Minion did not return. [Not connected]
buildbotm: Minion did not return. [Not connected]
hss1088: Minion did not return. [Not connected]
hss7: Minion did not return. [Not connected]
devops-testcenter-kaspersky-ua-07: Minion did not return. [Not connected]
intel-proxy-central-2: Minion did not return. [Not connected]
hss461: Minion did not return. [Not connected]
do-kl-sg-singapore-3: Minion did not return. [Not connected]
hss1102: Minion did not return. [Not connected]
hss770: Minion did not return. [Not connected]
hss1064: Minion did not return. [Not connected]
hss1065: Minion did not return. [Not connected]
do-kl-uk-london-1: Minion did not return. [Not connected]
hss980: Minion did not return. [Not connected]
hss984: Minion did not return. [Not connected]
hss1114: Minion did not return. [Not connected]
hss351: Minion did not return. [Not connected]
hss1046: Minion did not return. [Not connected]
hss452: Minion did not return. [Not connected]
hss1038: Minion did not return. [Not connected]

So, this is ugly, but having a handful of un-connected minions is no big deal. Now with batch mode (the salt master was restarted about an hour or two ago, so it should be about as stable as it can be). I've grep'd out the "junk" as they are successes. retcode: 0 is probably also a success.

ahammond@salt:~$ sudo salt --out=text -b 50 \* pillar.get salt_map:lookup:junk NOPE | grep -v junk

Executing run on ['hss188', 'hss189', 'devops-testcenter-kaspersky-nl-05', 'hss187', 'devops-testcenter-kaspersky-nl-07', 'devops-testcenter-kaspersky-nl-06', 'hss182', 'hss183', 'hss180', 'hss181', 'devops-testcenter-kaspersky-gb-02', 'devops-testcenter-kaspersky-gb-03', 'hss402', 'hss1039', 'hss1037', 'hss502', 'hss503', 'hss500', 'hss501', 'hss506', 'hss507', 'hss504', 'hss505', 'devops-testcenter-kaspersky-us-03', 'devops-testcenter-kaspersky-us-02', 'hss508', 'hss509', 'devops-testcenter-hss-04', 'devops-testcenter-hss-05', 'devops-testcenter-hss-06', 'devops-testcenter-hss-07', 'hss133', 'devops-testcenter-kaspersky-nl-04', 'hss134', 'shss99', 'shss98', 'depot', 'shss90', 'hss399', 'devops-testcenter-kaspersky-nl-03', 'devops-testcenter-kaspersky-nl-02', 'devops-testcenter-kaspersky-de-04', 'devops-testcenter-kaspersky-de-05', 'devops-testcenter-kaspersky-de-03', 'devops-testcenter-kaspersky-gb-01', 'hss216', 'hss215', 'devops-testcenter-hss-08', 'devops-testcenter-kaspersky-us-09', 'hss478']

retcode: 0
retcode: 0
retcode: 0
retcode: 0
retcode: 0
retcode: 0
retcode: 0
retcode: 0
retcode: 0
retcode: 0
retcode: 0
retcode: 0

Executing run on ['hss479', 'devops-testcenter-kaspersky-us-08', 'hss472', 'hss473', 'hss470', 'hss471', 'hss476', 'hss477', 'hss474', 'hss475', 'dev-tvpn1', 'hss109']

retcode: 0
retcode: 0
retcode: 0

Executing run on ['vu-tv-us-ch-1', 'devops-testcenter-hss-01', 'devops-testcenter-hss-02']

retcode: 0
retcode: 0
retcode: 0

Executing run on ['devops-testcenter-hss-03', 'devops-testcenter-kaspersky-us-04', 'devops-testcenter-hss-us-99']

retcode: 0
retcode: 0

Executing run on ['hss179', 'hss178']

devops-testcenter-kaspersky-us-02: {}
hss216: {}
devops-testcenter-kaspersky-de-03: {}
devops-testcenter-kaspersky-de-05: {}
devops-testcenter-kaspersky-de-04: {}
hss399: {}
devops-testcenter-kaspersky-nl-02: {}
hss478: {}
shss90: {}
depot: {}
shss98: {}
shss99: {}
hss133: {}
hss509: {}
hss508: {}
hss402: {}
devops-testcenter-kaspersky-us-03: {}
hss505: {}
hss504: {}
hss507: {}
hss506: {}
hss501: {}
hss500: {}
hss503: {}
hss502: {}
hss1039: {}
devops-testcenter-kaspersky-gb-01: {}
devops-testcenter-kaspersky-gb-03: {}
hss181: {}
hss180: {}
hss183: {}
hss182: {}
devops-testcenter-kaspersky-nl-06: {}
devops-testcenter-kaspersky-nl-07: {}
hss187: {}
devops-testcenter-kaspersky-nl-05: {}
hss189: {}
hss188: {}

Executing run on ['hss176', 'hss175', 'hss173', 'hss172', 'hss171', 'hss170', 'testcenter', 'dev-intel1', 'hss252', 'hss253', 'hss250', 'hss398', 'hss393', 'hss392', 'hss391', 'hss397', 'hss396', 'hss395', 'hss494', 'hss495', 'hss496', 'hss497', 'hss490', 'hss491', 'hss492', 'hss493', 'hss498', 'hss499', 'hss817', 'devops-testcenter-kaspersky-jp-01', 'hss739', 'hss738', 'hss731', 'hss733', 'hss732', 'hss735', 'hss734', 'hss737']

hss475: {}
hss474: {}
hss477: {}
hss476: {}
hss470: {}
hss479: {}
hss109: {}
retcode: 0
dev-tvpn1: {}
devops-testcenter-kaspersky-us-08: {}
retcode: 0

Executing run on ['hss736', 'bsl3', 'bsl1', 'bsl6', 'bsl7', 'bsl4', 'bsl5', 'bsl8', 'bsl9', 'hss400', 'shss131']

retcode: 0
retcode: 0

Executing run on ['shss132', 'shss135']

devops-testcenter-kaspersky-us-04: {}
retcode: 0

Executing run on ['hss426', 'hss429']

hss178: {}
hss179: {}
retcode: 0

Executing run on ['hss428', 'hss80', 'hss81']

hss499: {}
hss498: {}
hss493: {}
hss492: {}
hss491: {}
hss490: {}
hss497: {}
hss496: {}
hss495: {}
hss494: {}
hss395: {}
hss396: {}
hss397: {}
hss391: {}
hss392: {}
hss393: {}
hss398: {}
hss250: {}
hss253: {}
hss252: {}
hss737: {}
hss734: {}
hss735: {}
hss732: {}
hss733: {}
hss731: {}
hss738: {}
hss739: {}
devops-testcenter-kaspersky-jp-01: {}
dev-intel1: {}
hss170: {}
hss171: {}
hss172: {}
hss173: {}
hss175: {}
hss176: {}

Executing run on ['hss83', 'hss748', 'hss85', 'hss86', 'hss744', 'hss89', 'hss746', 'hss747', 'hss740', 'hss741', 'hss742', 'hss743', 'hss190', 'intel-proxy-sfo1-1', 'hss1123', 'hss1120', 'intel-proxy-sfo1-2', 'hss1126', 'hss1127', 'hss758', 'hss291', 'hss1041', 'hss1040', 'hss1043', 'hss1042', 'hss1045', 'hss1044', 'hss299', 'hss106', 'hss510', 'hss104', 'hss512', 'hss102', 'hss514', 'hss100', 'hss101']

hss736: {}
retcode: 0
retcode: 0
bsl8: {}
retcode: 0
retcode: 0
bsl5: {}
bsl4: {}
bsl7: {}
bsl6: {}
bsl1: {}
hss400: {}
shss131: {}
retcode: 0

Executing run on ['hss519', 'hss518', 'devops-testcenter-kaspersky-us-10', 'devops-testcenter-kaspersky-us-11', 'devops-testcenter-kaspersky-us-12', 'devops-testcenter-kaspersky-us-13', 'hss745', 'devops-testcenter-kaspersky-dk-02', 'hss50', 'hss842', 'hss841', 'hss847', 'hss849', 'hss262']

retcode: 0
retcode: 0
retcode: 0
shss132: {}

Executing run on ['hss778', 'hss469', 'hss468', 'hss382']

hss428: {}
hss426: {}
hss81: {}
hss80: {}

Executing run on ['hss460', 'hss463', 'hss462', 'hss465']

retcode: 0
retcode: 0

Executing run on ['hss140', 'hss467']

hss101: {}
hss100: {}
hss514: {}
hss102: {}
hss758: {}
hss104: {}
hss510: {}
hss106: {}
hss299: {}
hss1045: {}
hss1043: {}
hss291: {}
hss512: {}
hss1126: {}
intel-proxy-sfo1-2: {}
hss1120: {}
hss1123: {}
intel-proxy-sfo1-1: {}
hss190: {}
hss743: {}
hss742: {}
hss741: {}
hss740: {}
hss747: {}
hss746: {}
hss89: {}
hss744: {}
hss86: {}
hss85: {}
hss748: {}
hss83: {}

Executing run on ['hss466', 'devops-testcenter-kaspersky-ua-09', 'devops-testcenter-kaspersky-ua-08', 'hss13', 'hss666', 'hss12', 'hss11', 'hss4', 'hss6', 'hss1', 'hss749', 'hss1162', 'hss1085', 'hss1084', 'hss1087', 'hss1086', 'hss1083', 'hss148', 'hss149', 'hss419', 'devops-testcenter-kaspersky-ca-01', 'hss414', 'hss415', 'hss416', 'hss417', 'hss146', 'hss411', 'hss412', 'hss413', 'hss283', 'hss1118']

hss518: {}
hss847: {}
hss841: {}
hss842: {}
devops-testcenter-kaspersky-dk-02: {}
devops-testcenter-kaspersky-us-13: {}
hss745: {}
hss519: {}
hss262: {}
hss50: {}
retcode: 0
hss849: {}

Executing run on ['hss1089', 'hss241', 'hss240', 'hss243', 'hss245', 'hss244', 'hss247', 'hss246', 'hss249', 'hss248', 'intel-tvpn-nyc3-4', 'intel-tvpn-nyc3-5']

hss382: {}
hss778: {}
hss468: {}
hss469: {}

Executing run on ['intel-tvpn-nyc3-2', 'intel-tvpn-nyc3-3', 'intel-tvpn-nyc3-1', 'hss483']

hss467: {}
hss462: {}
hss140: {}
hss460: {}

Executing run on ['hss481', 'hss480', 'hss487', 'hss486']

hss413: {}
hss412: {}
hss411: {}
hss146: {}
hss417: {}
hss416: {}
hss415: {}
hss414: {}
devops-testcenter-kaspersky-ca-01: {}
hss419: {}
hss149: {}
hss148: {}
hss11: {}
hss12: {}
hss13: {}
hss1083: {}
hss1086: {}
hss1087: {}
hss1084: {}
hss1085: {}
hss1162: {}
hss1: {}
hss6: {}
hss4: {}
hss666: {}
hss749: {}
devops-testcenter-kaspersky-ua-09: {}
hss466: {}
hss1118: {}
hss283: {}

Executing run on ['hss485', 'hss484', 'hss520', 'hss521', 'hss522', 'hss488', 'hss525', 'hss526', 'hss527', 'hss803', 'hss802', 'bsl10', 'hss237', 'hss230', 'hss238', 'hss239', 'hss333', 'hss332', 'vu-tv-au-3', 'salt', 'vu-tv-au-4', 'devops-testcenter-kaspersky-se-02', 'devops-testcenter-kaspersky-se-01', 'do-tv-us-ny-1', 'vu-tv-jp-9', 'hss453', 'hss454', 'hss455', 'hss457', 'hss458']

retcode: 0
hss1089: {}
retcode: 0
retcode: 0
retcode: 0
intel-tvpn-nyc3-5: {}
intel-tvpn-nyc3-4: {}
hss248: {}
hss249: {}
hss246: {}
hss247: {}
hss244: {}
hss245: {}
hss243: {}
hss240: {}
hss241: {}

Executing run on ['hss759', 'intel-proxy-nyc3-4', 'intel-proxy-nyc3-3', 'intel-proxy-nyc3-2', 'intel-proxy-nyc3-1', 'hss753', 'hss752', 'hss750', 'hss757', 'hss756', 'hss755', 'hss754', 'hss142', 'hss143', 'hss1056', 'hss1057']

intel-tvpn-nyc3-1: {}
retcode: 0
intel-tvpn-nyc3-3: {}
intel-tvpn-nyc3-2: {}
hss483: {}

Executing run on ['hss1054', 'hss1055', 'hss1052', 'hss1050', 'hss410']

hss486: {}
hss487: {}
hss480: {}
hss481: {}

Executing run on ['hss1058', 'hss1059', 'hss285', 'hss284']

hss332: {}
hss333: {}
hss239: {}
hss238: {}
hss230: {}
hss237: {}
hss457: {}
hss455: {}
hss454: {}
hss453: {}
hss802: {}
hss803: {}
hss527: {}
hss526: {}
hss525: {}
hss488: {}
hss522: {}
hss521: {}
hss520: {}
hss484: {}
hss485: {}
devops-testcenter-kaspersky-se-02: {}
vu-tv-au-4: {}
vu-tv-jp-9: {}
salt: {}
vu-tv-au-3: {}

Executing run on ['hss287', 'hss286', 'hss281', 'hss280', 'hss1119', 'hss282', 'hss1117', 'hss1116', 'hss1115', 'hss289', 'hss288', 'dev-mcafee1', 'hss115', 'hss114', 'hss116', 'hss111', 'hss110', 'hss113', 'hss119', 'hss118', 'devops-testcenter-kaspersky-ca-02', 'hss370', 'hss373', 'hss375', 'hss374', 'hss377']

hss143: {}
retcode: 0
hss142: {}
hss754: {}
hss756: {}
hss757: {}
hss750: {}
hss752: {}
hss753: {}
retcode: 0
intel-proxy-nyc3-1: {}
intel-proxy-nyc3-2: {}
intel-proxy-nyc3-3: {}
intel-proxy-nyc3-4: {}
hss759: {}
hss1057: {}
hss1056: {}

Executing run on ['hss376', 'hss379', 'hss378', 'hss278', 'devops-testcenter-kaspersky-dk-01', 'hss270', 'hss271', 'hss272', 'hss274', 'hss275', 'hss276', 'devops-testcenter-kaspersky-tr-01', 'devops-testcenter-kaspersky-tr-02', 'devops-testcenter-intel-04', 'devops-testcenter-intel-05', 'devops-testcenter-intel-06', 'devops-testcenter-intel-07']

hss410: {}
retcode: 0
retcode: 0
hss1050: {}
retcode: 0
hss1052: {}
hss1055: {}
hss1054: {}
retcode: 0
retcode: 0

Executing run on ['devops-testcenter-intel-01', 'devops-testcenter-intel-02', 'devops-testcenter-intel-03', 'hss349', 'devops-testcenter-intel-08', 'devops-testcenter-intel-09', 'hss304', 'hss305', 'hss306', 'hss307']

retcode: 0
retcode: 0
retcode: 0
retcode: 0
retcode: 0
retcode: 0

Executing run on ['hss300', 'hss302', 'intel-proxy-nyc3-5', 'hss308', 'hss309', 'hss1098']

hss1059: {}
hss1058: {}
hss284: {}
hss285: {}

Executing run on ['hss1099', 'intel-tvpn-sfo1-1', 'intel-tvpn-sfo1-2', 'hss1092']

retcode: 0
retcode: 0

Executing run on ['hss1093', 'hss1090']

hss377: {}
hss374: {}
hss375: {}
hss373: {}
hss370: {}
hss118: {}
hss119: {}
hss113: {}
hss110: {}
hss111: {}
hss116: {}
hss114: {}
hss115: {}
dev-mcafee1: {}
hss288: {}
hss289: {}
hss1115: {}
hss1116: {}
hss282: {}
hss1119: {}
hss280: {}
hss281: {}
hss286: {}
hss287: {}

Executing run on ['hss1091', 'hss1097', 'hss1094', 'hss409', 'devops-testcenter-kaspersky-fr-03', 'devops-testcenter-kaspersky-fr-02', 'devops-testcenter-kaspersky-fr-01', 'hss151', 'hss150', 'hss401', 'hss152', 'hss407', 'hss406', 'hss405', 'devops-testcenter-kaspersky-cz-05', 'devops-testcenter-kaspersky-cz-02', 'devops-testcenter-kaspersky-es-03', 'devops-testcenter-kaspersky-es-02', 'devops-testcenter-kaspersky-ie-06', 'devops-testcenter-kaspersky-ie-01', 'hss1122', 'hss530', 'hss297', 'hss129']

devops-testcenter-kaspersky-tr-02: {}
hss276: {}
hss275: {}
hss274: {}
retcode: 0
hss272: {}
hss271: {}
hss270: {}
hss278: {}
retcode: 0
devops-testcenter-kaspersky-tr-01: {}
hss378: {}
hss379: {}
hss376: {}
retcode: 0
retcode: 0

Executing run on ['hss292', 'hss122', 'hss290', 'devops-testcenter-kaspersky-hk-04', 'hss528', 'devops-testcenter-kaspersky-de-13', 'devops-testcenter-kaspersky-de-12', 'devops-testcenter-kaspersky-de-10', 'devops-testcenter-kaspersky-de-14', 'hss223', 'hss222', 'devops-testcenter-kaspersky-hk-03', 'hss224', 'hss229', 'hss228', 'hss511']

retcode: 0
retcode: 0
retcode: 0
retcode: 0
hss307: {}
hss306: {}
hss305: {}
hss304: {}

Executing run on ['hss107', 'hss513', 'hss515', 'hss489', 'hss523', 'hss517', 'hss516', 'hss766']

hss309: {}
hss308: {}
hss302: {}
hss300: {}
hss1098: {}
retcode: 0
intel-proxy-nyc3-5: {}

Executing run on ['hss767', 'hss764', 'hss765', 'hss762', 'hss763', 'hss760', 'hss761']

intel-tvpn-sfo1-2: {}
intel-tvpn-sfo1-1: {}

Executing run on ['hss108', 'hss446']

hss1090: {}
hss1093: {}

Executing run on ['hss445', 'hss444']

devops-testcenter-kaspersky-ie-01: {}
retcode: 0
devops-testcenter-kaspersky-es-02: {}
hss405: {}
hss129: {}
devops-testcenter-kaspersky-es-03: {}
hss297: {}
hss1122: {}
hss152: {}
hss1094: {}
hss406: {}
hss407: {}
hss1091: {}
hss401: {}
hss150: {}
hss151: {}
hss530: {}
devops-testcenter-kaspersky-fr-01: {}
devops-testcenter-kaspersky-fr-02: {}
devops-testcenter-kaspersky-fr-03: {}
hss409: {}

Executing run on ['hss443', 'do-kl-nl-amsterdam-2', 'hss768', 'hss440', 'devops-testcenter-kaspersky-sg-04', 'devops-testcenter-kaspersky-sg-01', 'hss141', 'devops-testcenter-kaspersky-sg-03', 'devops-testcenter-kaspersky-sg-02', 'hss1063', 'hss1062', 'hss1100', 'hss1101', 'do-tv-de-frankfurt-1', 'hss169', 'hss267', 'hss266', 'hss265', 'hss264', 'hss389', 'hss261']

retcode: 0
hss122: {}
retcode: 0
retcode: 0
hss228: {}
hss229: {}
retcode: 0
hss224: {}
hss290: {}
hss292: {}
hss222: {}
hss528: {}
devops-testcenter-kaspersky-de-10: {}
devops-testcenter-kaspersky-de-12: {}
retcode: 0
retcode: 0
hss223: {}
retcode: 0
hss511: {}

Executing run on ['hss260', 'hss384', 'hss386', 'hss387', 'hss380', 'hss381', 'hss269', 'hss383', 'hss79', 'hss74', 'hss303', 'hss313', 'hss312', 'hss310', 'hss315', 'hss314', 'hss319', 'devops-testcenter-intel-10', 'hss448']

hss523: {}
hss489: {}
retcode: 0
hss761: {}
hss760: {}
hss763: {}
hss762: {}
hss765: {}
hss764: {}
hss767: {}
hss516: {}
hss517: {}
hss515: {}
hss513: {}
hss107: {}

Executing run on ['hss723', 'hss726', 'hss724', 'hss769', 'shss144', 'hss437', 'hss439', 'hss95', 'hss331', 'hss403']

retcode: 0
hss108: {}
retcode: 0
hss446: {}
hss445: {}
hss440: {}
hss1101: {}
hss1100: {}
devops-testcenter-kaspersky-sg-01: {}
hss141: {}
hss266: {}
hss169: {}
hss261: {}
hss389: {}
hss267: {}
hss265: {}
hss1062: {}
hss1063: {}
hss448: {}
hss319: {}
hss303: {}
hss315: {}
hss310: {}
hss74: {}
hss312: {}
hss313: {}
hss383: {}
hss269: {}
hss381: {}
hss380: {}
hss387: {}
hss386: {}
hss384: {}
hss260: {}
hss314: {}
hss79: {}
hss439: {}
hss724: {}
hss726: {}
hss723: {}
hss403: {}
hss437: {}
hss331: {}
shss144: {}
hss95: {}
ahammond@salt:~$

Versions Report

Master (We have applied patch https://github.com/saltstack/salt/pull/36024 to the master, but not any minions)

ahammond@salt:~$ sudo salt-call --local --versions-report
Salt Version:
           Salt: 2016.3.3

Dependency Versions:
           cffi: 1.5.2
       cherrypy: 2.3.0
       dateutil: 2.4.2
          gitdb: 0.6.4
      gitpython: 2.0.6
          ioflo: Not Installed
         Jinja2: 2.8
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: 1.0.3
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: Not Installed
      pycparser: 2.14
       pycrypto: 2.6.1
         pygit2: Not Installed
         Python: 2.7.12 (default, Jul  1 2016, 15:12:24)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 15.2.0
           RAET: Not Installed
          smmap: 0.9.0
        timelib: 0.2.4
        Tornado: 4.2.1
            ZMQ: 4.1.4

System Versions:
           dist: Ubuntu 16.04 xenial
        machine: x86_64
        release: 4.4.0-36-generic
         system: Linux
        version: Ubuntu 16.04 xenial

Minions

[root@hss4 ~]# salt-call --local --versions-report
Salt Version:
           Salt: 2016.3.3

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: 1.3.8
         Jinja2: 2.7.3
        libgit2: Not Installed
        libnacl: 1.4.3
       M2Crypto: 0.20.2
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
         pygit2: Not Installed
         Python: 2.6.6 (r266:84292, Jun 18 2012, 14:18:47)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 14.5.0
           RAET: 0.6.3
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.5

System Versions:
           dist: centos 6.2 Final
        machine: x86_64
        release: 4.4.18-2.el6.x86_64
         system: Linux
        version: CentOS 6.2 Final
ahammond commented 8 years ago

@markahopper This one is a show-stopper for us.

ahammond commented 8 years ago

@white-hat

Ch3LL commented 8 years ago

@ahammond would you mind sharing the pillar that is causing this issue?

also is this consistent behavior with the same minions? Are your minions all the same version?

ahammond commented 8 years ago

Whoops. Sorry, updated description. Also, this pillar generally works (picking a minion at random):

ahammond@salt:~$ sudo salt hss4 pillar.get salt_map
hss4:
    ----------
    lookup:
        ----------
        junk:
            junk
ahammond@salt:~$ sudo salt hss4 pillar.get salt_map:lookup:junk
hss4:
    junk
Ch3LL commented 8 years ago

Okay I've tried to setup a test case and I am not able to replicate this behavior. Maybe you can point me in the right direction as to whether my test case is wrong. Here is my test case:

5 minions:

[root@ip-10-63-90-123 ~]# salt-key
Accepted Keys:
ch3ll-master-salt
ch3ll-web1
ch3ll-web2
ch3ll-web3
ch3ll-web4

I'm wondering if 5 might not be enough. Do you see this behavior whith a smaller group of minions? I can also try to add more if needed.

Using the pillar you posted in the example above and running pillar.get:

[root@ip-10-63-90-123 ~]# sudo salt --out=text -b 50 \* pillar.get salt_map:lookup:junk NOPE | grep -v junk

Executing run on ['ch3ll-web2', 'ch3ll-web3', 'ch3ll-web1', 'ch3ll-master-salt', 'ch3ll-web4']

retcode: 0
retcode: 0
retcode: 0
retcode: 0
retcode: 0

As shown above I never see something similar to hss386: {}. In fact when I completely remove the pillar from the minions and only target one in the pillar top file I still od not see {} as shown below:

[root@ip-10-63-90-123 ~]# cat /srv/pillar/top.sls                                                                     
base:
  'ch3ll-web3':
    - salt_map
[root@ip-10-63-90-123 ~]# sudo salt --out=text -b 50 \* pillar.get salt_map:lookup:junk NOPE | grep -v junk

Executing run on ['ch3ll-web2', 'ch3ll-web3', 'ch3ll-web1', 'ch3ll-master-salt', 'ch3ll-web4']

ch3ll-master-salt: NOPE
retcode: 0
retcode: 0
ch3ll-web2: NOPE
ch3ll-web1: NOPE
retcode: 0
retcode: 0
ch3ll-web4: NOPE
retcode: 0

or smaller batch size:

[root@ip-10-63-90-123 ~]# sudo salt --out=text -b 2 \* pillar.get salt_map:lookup:junk NOPE | grep -v junk

Executing run on ['ch3ll-web2', 'ch3ll-web3']

retcode: 0
retcode: 0
ch3ll-web2: NOPE

Executing run on ['ch3ll-web1', 'ch3ll-master-salt']

ch3ll-master-salt: NOPE
retcode: 0
ch3ll-web1: NOPE
retcode: 0

Executing run on ['ch3ll-web4']

ch3ll-web4: NOPE
retcode: 0

Any information to help me replicate this?

ahammond commented 8 years ago

@Ch3LL So... the NOPE should never appear since every minion should be able to find salt_map:lookup:junk. Seems to me you've replicated at least that failure. Getting the empty dictionaries... now I don't know how to get that to replicate.

Ch3LL commented 8 years ago

apologies I don't think i was very clear on my test case. Where NOPE shows up is during the test case where i only add the pillar to ch3ll-web03 and is not applied to any of hte other minions. I was trying to create an instance were i would see the empty {}. so this would not be a replication of the issue from what i can tell. From what I understand from your use case is the pillar should be applied to all minions.

when the all minions have that pillar applied I see the following:

[root@ip-10-63-90-123 ~]# sudo salt --out=text -b 50 \* pillar.get salt_map:lookup:junk NOPE | grep -v junk

Executing run on ['ch3ll-web2', 'ch3ll-web3', 'ch3ll-web1', 'ch3ll-master-salt', 'ch3ll-web4']

retcode: 0
retcode: 0
retcode: 0
retcode: 0
retcode: 0

Do you have anything unusual setup with your pillar setup? maybe gitfs or pillar caching similar to this

Also anything relevant in the debug logs on master or minion side?

ahammond commented 8 years ago

This pillar comes from a git ext_pillar. We have other ext pillars. The code for the custom ext_pillars hasn't changed in over 4 months.

ahammond commented 8 years ago

Ok, following our switch back to 0mq, this doesn't appear to be replicating anymore.

Ch3LL commented 8 years ago

So it seems this is an issue with the tcp transport then? I will label as a tcp transport issue then. We will keep open since this still will need to be fixed by users that are using tcp transport.

ahammond commented 7 years ago

Nope, we're seeing it again using 0mq. I think this is more to do with batch mode's return than anything else. I think the tcp transport just made it more obvious. Here's some partial data from a batched run recently. I'll get a better example later today.

retcode: 0
hss392: Stopping Clam AntiVirus Daemon: [  OK  ]
hss374: Stopping Clam AntiVirus Daemon: [  OK  ]
retcode: 0
hss389: Stopping Clam AntiVirus Daemon: [  OK  ]
retcode: 0
retcode: 0
hss140: Stopping Clam AntiVirus Daemon: [FAILED]

Executing run on ['hss445', 'hss409', 'hss151', 'hss402', 'hss401', 'hss152']

retcode: 0
hss89: Stopping Clam AntiVirus Daemon: [  OK  ]

Executing run on ['hss407']

hss1094: {}
hss1097: {}
hss1091: {}
hss1090: {}
hss1093: {}
hss1092: {}
hss171: {}
hss1099: {}
hss1098: {}
hss319: {}
hss314: {}
hss315: {}
hss312: {}
hss494: {}
hss413: {}
hss395: {}
hss396: {}
hss397: {}
hss417: {}
hss414: {}
hss419: {}
hss398: {}
hss250: {}
hss169: {}
hss1083: {}
hss74: {}
hss1086: {}
hss1087: {}
hss1084: {}
hss1085: {}
hss1088: {}
hss1089: {}
hss79: {}
hss309: {}
hss308: {}
hss6: {}
hss303: {}
hss302: {}
hss307: {}
hss306: {}
hss305: {}
hss304: {}
hss383: {}
hss269: {}
hss381: {}
hss380: {}
hss387: {}
hss386: {}
hss176: {}
hss384: {}
hss178: {}
hss261: {}
hss262: {}
hss264: {}
hss265: {}
hss266: {}
hss466: {}
hss170: {}
hss465: {}
hss462: {}
hss463: {}
hss382: {}
hss172: {}
hss468: {}
hss469: {}
hss173: {}
hss175: {}
hss277: {}
hss276: {}
hss275: {}
hss274: {}
hss272: {}
hss271: {}
hss270: {}
hss260: {}
hss378: {}
hss379: {}
hss376: {}
hss377: {}
hss373: {}
hss1101: {}
hss1100: {}
hss849: {}
hss841: {}
hss412: {}
hss1118: {}
hss179: {}
hss393: {}
hss109: {}
hss1126: {}
hss448: {}

Executing run on ['hss406', 'hss405', 'hss148', 'hss481', 'hss480', 'hss487', 'hss520', 'hss521', 'hss523', 'hss769', 'hss803', 'hss802', 'hss437', 'hss13', 'hss12', 'hss11', 'hss439', 'hss1122', 'hss452', 'hss403', 'hss129', 'hss122', 'hss150', 'hss331', 'hss333', 'hss332']

hss817: {}
hss1117: {}
hss445: {}
hss152: {}
hss401: {}
hss402: {}
hss151: {}
hss409: {}
hss407: {}
hss332: {}
hss122: {}
hss331: {}
hss333: {}
hss129: {}
hss148: {}
hss439: {}
hss11: {}
hss12: {}
hss13: {}
hss452: {}
hss437: {}
hss1122: {}
hss802: {}
hss803: {}
hss523: {}
hss521: {}
hss520: {}
hss487: {}
hss480: {}
hss481: {}
hss405: {}
hss406: {}
hss150: {}
hss403: {}
hss769: {}
cachedout commented 7 years ago

While there are a few places where this could occur, the most common place where we'll see the client yield an empty dictionary like that is if it can't find the job in the master job cache. Do you ever see a warning message from the client that says something like jid does not exist or Returner unavailable?

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.