saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Install Salt from the Salt package repositories here:
https://docs.saltproject.io/salt/install-guide/en/latest/
Apache License 2.0
14.18k stars 5.48k forks source link

Salt masters behind AWS ELB have flaky connection to minions #43368

Closed fzk-rec closed 5 years ago

fzk-rec commented 7 years ago

Description of Issue/Question

The connection of the salt-masters that run behind an AWS ELB to the salt-minions is flaky. Sometimes they work, most times they don't. I would like to know if there is some flaw in my setup that I am not seeing, or if Salt only works with an HA Proxy as an ELB?

Or maybe Salt doesn't work at all behind an ELB?

Setup

I am running the following setup at AWS:

A salt-key -L on both masters yield the same result:

Accepted Keys:
WIN-AB3GO7BJ72I
WIN-EDMP9VB716B
Denied Keys:
Unaccepted Keys:
Rejected Keys:

So it looks like all is fine and everything should work. However, a test.ping is extremely flaky. Sometimes it works, but most of the time it doesnt. Most of the time neither master gets any return from the minion and on the minion side I can see in the log that the minion never receives the message to execute 'test.ping' from the master. Example 1: test.ping from Master1:

root@d7383ff8f8bf:/# salt 'WIN-EDMP9VB716B' test.ping
[ERROR   ] Exception raised when processing __virtual__ function for salt.loaded.int.cache.consul. Module will not be loaded: 'module' object has no attribute 'Consul'
[ERROR   ] An un-handled exception was caught by salt's global exception handler:
KeyError: 'redis.ls'
Traceback (most recent call last):
  File "/usr/bin/salt", line 10, in <module>
    salt_main()
  File "/usr/lib/python2.7/dist-packages/salt/scripts.py", line 476, in salt_main
    client.run()
  File "/usr/lib/python2.7/dist-packages/salt/cli/salt.py", line 173, in run
    for full_ret in cmd_func(**kwargs):
  File "/usr/lib/python2.7/dist-packages/salt/client/__init__.py", line 805, in cmd_cli
    **kwargs):
  File "/usr/lib/python2.7/dist-packages/salt/client/__init__.py", line 1597, in get_cli_event_returns
    connected_minions = salt.utils.minions.CkMinions(self.opts).connected_ids()
  File "/usr/lib/python2.7/dist-packages/salt/utils/minions.py", line 577, in connected_ids
    search = self.cache.ls('minions')
  File "/usr/lib/python2.7/dist-packages/salt/cache/__init__.py", line 244, in ls
    return self.modules[fun](bank, **self._kwargs)
  File "/usr/lib/python2.7/dist-packages/salt/loader.py", line 1113, in __getitem__
    func = super(LazyLoader, self).__getitem__(item)
  File "/usr/lib/python2.7/dist-packages/salt/utils/lazy.py", line 101, in __getitem__
    raise KeyError(key)
KeyError: 'redis.ls'
Traceback (most recent call last):
  File "/usr/bin/salt", line 10, in <module>
    salt_main()
  File "/usr/lib/python2.7/dist-packages/salt/scripts.py", line 476, in salt_main
    client.run()
  File "/usr/lib/python2.7/dist-packages/salt/cli/salt.py", line 173, in run
    for full_ret in cmd_func(**kwargs):
  File "/usr/lib/python2.7/dist-packages/salt/client/__init__.py", line 805, in cmd_cli
    **kwargs):
  File "/usr/lib/python2.7/dist-packages/salt/client/__init__.py", line 1597, in get_cli_event_returns
    connected_minions = salt.utils.minions.CkMinions(self.opts).connected_ids()
  File "/usr/lib/python2.7/dist-packages/salt/utils/minions.py", line 577, in connected_ids
    search = self.cache.ls('minions')
  File "/usr/lib/python2.7/dist-packages/salt/cache/__init__.py", line 244, in ls
    return self.modules[fun](bank, **self._kwargs)
  File "/usr/lib/python2.7/dist-packages/salt/loader.py", line 1113, in __getitem__
    func = super(LazyLoader, self).__getitem__(item)
  File "/usr/lib/python2.7/dist-packages/salt/utils/lazy.py", line 101, in __getitem__
    raise KeyError(key)
KeyError: 'redis.ls'

I am aware that the redis error will be fixed soon https://github.com/saltstack/salt/issues/43295

Example 2: test.ping from Master1, ~ 1 Minute after Example 1:

root@d7383ff8f8bf:/# salt 'WIN-EDMP9VB716B' test.ping
WIN-EDMP9VB716B:
    True

Also during my tests, a test.ping from Master2 never succeeded.

Steps to Reproduce Issue

Versions Report

Salt Version:
           Salt: 2017.7.1

Dependency Versions:
           cffi: Not Installed
       cherrypy: unknown
       dateutil: 2.4.2
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.8
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 2.7.12 (default, Nov 19 2016, 06:48:10)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 15.2.0
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.1.4

System Versions:
           dist: Ubuntu 16.04 xenial
         locale: ANSI_X3.4-1968
        machine: x86_64
        release: 4.9.43-17.38.amzn1.x86_64
         system: Linux
        version: Ubuntu 16.04 xenial
damon-atkins commented 7 years ago

What salt transport are you using? ZeroMQ? What is the Failover setting on the ELB?

fzk-rec commented 7 years ago

@damon-atkins I am using ZeroMQ.

What do you mean with Failover setting? I assume the ELB splits incoming traffic 50/50, or maybe based on CPU usage.

damon-atkins commented 7 years ago

Sorry, should have said general settings. e.g. timeout, health check, etc?
Salt as two ports I assume both from a single client would need to go to the same Salt Master? I cannot tell you, what you are trying will work or not. Have you read https://docs.saltstack.com/en/latest/topics/highavailability/index.html https://docs.saltstack.com/en/latest/topics/tutorials/multimaster.html

Ch3LL commented 7 years ago

yes as @damon-atkins pointed out we already have the ability to load balance masters without a loadbalancer in front of them with the docs he pointed you to.

I am not sure if we support the master failover how you have it setup in front of a load balancer though. ping @saltstack/team-core do you know if this is a possibility currently within salt?

fzk-rec commented 7 years ago

Thanks for your answer guys, but: @damon-atkins : yes I have read all the HA/multimaster docs I could find :) I think however, you might be onto something. The AWS ELB is unable to provide session stickiness for TCP ports. So, maybe you're right and the minions get sent to Master1 on port 4505 and to Master2 on port 4506. I guess that would cause problems.

@Ch3LL : The problem is that your view on 'loadbalancing' the masters is not sexy for cloud environments, since you always have to specify IPs or names. That's stuff that can change in a cloud environment all the time. Which is why we want to run the salt-masters behind an ELB, so that no minion ever needs to know the actual IP of a salt-master. We basically want to run 'Salt as a service' and provide the salt-api and the salt-masters to our internal services and consumers through one simple ELB DNS entry. This would give us massive benefits as we could:

I'm afraid that our setup won't work because of missing session stickyness :( I assume there is no way to have the salt-masters share one ZeroMQ?

gtmanfred commented 7 years ago

I would be very surprised if the zeromq transport worked behind an ELB.

You might try the tcp transport layer, and see if that one works?

The better solution would be to possibly use master_type: func which you could have a custom execution module that looks up your masters in the aws api and return a list of all the masters, and then opens an Active-Active connection to each one of the masters, instead of failover.

https://docs.saltstack.com/en/latest/ref/configuration/minion.html#list-of-masters-syntax

Then you don't have to have the minions store the information, but you don't need a load balancer.

fzk-rec commented 7 years ago

@gtmanfred thx for the tip with the master_type: func! I didn't know about that, but I believe this could solve our problem. We could 'autodiscover' the salt-masters via EC2-Tag and simply add their IPs as a list to the minions and have the salt-minion restart on a regular basis.

Just two questions:

I don't understand what you mean with 'try the transport layer'. Where would I configure that? Is that a setting from the minion config?

gtmanfred commented 7 years ago

1) you would need to provide it to the minion in the modules extension modules directory.

https://docs.saltstack.com/en/latest/ref/file_server/dynamic-modules.html

One problem is that you will need to sync this before you can run the function as the startup type, so you might need to have it in the fileserver for a masterless minion, and call salt-call --local saltutils.sync_modules first and then start the minion.

And as a lesser known feature, you could write a pip module that you could install, that has the salt.loader as part of the entry points like this

https://github.com/saltstack/salt/pull/31218

but point to module_dirs, and you should be able to pip install that on the minion. (there isn't a good documentation on this, other than what is in the PR above.)

2) I believe that the module would have to handle the errors, otherwise the minion would end up with an empty list without any minions

And for the transport layers, you would need to configure both the master and minion to use the tcp transport layer https://docs.saltstack.com/en/latest/topics/transports/tcp.html

fzk-rec commented 7 years ago

Cool beans! I will definately look into that tomorrow. Thanks alot!

Not sure if I want to use that experimental TCP feature :)

damon-atkins commented 7 years ago

I would suggest you will need scripts as part of installing salt that work out where the master is and call salt-call --master=abc --local saltutils.sync_modules as suggest by gtmanfred before swamping to master_type: func

also look at https://docs.saltstack.com/en/latest/topics/cloud/aws.html

Update this issue when you work out the best solution.

fzk-rec commented 7 years ago

Ok, so I will have to work on some other topic and put this 'master behind ELB' on hold for a while, but this is the solution we will most likely use going forward:

I wrote a script that we will put on the salt-minion hosts and which will be triggered by a scheduled task like once per hour (maybe more frequently, we'll see).

from __future__ import print_function
import boto3
import urllib2
import logging
import yaml
import fileinput
import re

log = logging.getLogger('saltmaster-discovery')
hdlr = logging.FileHandler('C:\Deployment\saltmaster-discovery.log')
formatter = logging.Formatter('%(asctime)s %(levelname)s %(message)s')
hdlr.setFormatter(formatter)
log.addHandler(hdlr)
log.setLevel(logging.DEBUG)

# Get region
region = urllib2.urlopen('http://169.254.169.254/latest/meta-data/placement/availability-zone').read()[:-1]
client = boto3.client('ec2', region)

log.info('Initializing saltmaster discovery...')
try:
    response = client.describe_instances(
        Filters=[
            {
                'Name': 'tag-key',
                'Values': [
                    'Function',
                ]
            },
            {
                'Name': 'tag-value',
                'Values': [
                    'Saltmaster',
                    'saltmaster'
                ]
            }
        ]
    )
except:
    log.error('AWS API call failed!')
    exit(1)

api_private_ips = []
for res in response["Reservations"]:
    try:
        for inst in res["Instances"]:
            try:
                api_private_ips.append(inst["PrivateIpAddress"])
            except:
                continue
    except:
        continue

log.debug('List of private ips from the salt-masters: %s' % api_private_ips)

with open('C:\salt\conf\minion', 'r') as fp_:
    report = yaml.safe_load(fp_.read())
minion_private_ips = report['master']
log.debug('Current master value in the minion config: %s' % report['master'])

# Compare the items from the minion config with the list of available masters. If they are not equal, update the
# values in the minion config with the values from the API call and restart the minion

regex = re.compile('master: \[.*\]', re.IGNORECASE)
if set(minion_private_ips) == set(api_private_ips):
    log.info('The list of masters in the minion config is up to date')
else:
    log.info('The list of masters in the minion config will be updated')
    f = fileinput.FileInput('C:\salt\conf\minion', inplace=True, backup='.bak')
    for line in f:
        line = regex.sub('master: ' + str(api_private_ips), line)
        print(line, end='')
    f.close()
    # Trigger the salt-minion restart
    log.info('Restarting the salt-minion service')
    from subprocess import call
    call("C:\salt\salt-call.bat --local service.restart salt-minion", shell=True)

This script calls the AWS API to find all ec2-instances with the tag 'Function: saltmaster/Saltmaster'. Then it checks the values of the salt-minion config (it searches for the line master: ['some ip', 'another ip']) and if they match the list of private ips that we got back from the api call. If the lists don't match, we update the minion config with the list from the api call and then restart the minion.

The script is most certainly not perfect and is not final (f.e. I want to implement a check to see if the mininon is currently working on a salt-job before restarting the service), but I just wanted to share what I've got so far.

Additional info: Amazon just introduced Network Loadbalancer. I tried those out quickly, but couldn't get it to work :(

gtmanfred commented 7 years ago

This looks awesome, thanks for sharing!

Daniel

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

ghost commented 4 years ago

Hello team,

I hope everyone is doing great.

I came across this thread while researching ways of implementing the very same architecture in our environment. Has this been addressed?

Our problem statement is: because our minions are actually our product customers' instances, we need the salt masters to be publicly available since they live in different networks/VPCs, and VPC peering doesn't work here due to its connections limitations. Having said that, for security purposes, we need to find an alternative to make the salt masters private, but still reachable from the minions. Two things come to mind: a) iptables (not scalable); and b) AWS PrivateLink (requires a NLB to front the salt masters).

Since this was not available back in 2017, I thought of having a syndic master of masters behind the NLB alongside two other msters, but I am not sure if this would be possible, or if load balancing is even possible with a multimaster architecture. Anyone able to shed some light here?

I really appreciate any help.

Thanks, Gabriel

max-arnold commented 9 months ago

Check out the new master cluster feature that is part of 3007rc1: https://github.com/saltstack/salt/blob/master/doc/topics/tutorials/master-cluster.rst

For more details see https://github.com/saltstack/salt-enhancement-proposals/blob/1433501a1417f78c895345a675e21a8b6382bb61/0000-master-cluster.md