sqs_events: SQS messages processed before reactor engine is ready

cyrilgdn commented 6 years ago

Description of Issue/Question

We are using the sqs_events engine but we are losing messages. When Salt is starting, sqs_events engine is ready and starts to send events before the reactor engines is properly started.

I also test to sleep a bit at the beginning of the start function of the sqs_events engine. With this, messages are not lost at startup anymore but we are still losing messages when restarting so it seems that when Salt is stopping, events may be sent when reactors are already stopped.

Test setup

For testing purpose, I've just created a reactor which creates empty files with filenames retrieved for the SQS queue and I'm sending messages in the queue manually.

Master configuration (relevant part):

log_level: debug
engines:
- sqs_events:
    queue: salt-queue
sqs:
  message_format: json
sqs.region: eu-central-1
reactor:
- salt/engine/sqs:
  - salt://test_sqs_events.sls

states/test_sqs_events.sls

#!py
# vim:ft=python

import json
import os

def run():
    if not os.path.isdir("/tmp/test_salt"):
        os.mkdir("/tmp/test_salt")

    message = json.loads(data['message'])
    open("/tmp/test_salt/{}".format(message["filename"]), 'a').close()

    return {}

Steps to Reproduce Issue

The first test is to stop salt-master, then to run

for i in $(seq 10); do aws sqs send-message --queue-url https://eu-central-1.queue.amazonaws.com/ACCOUNT_ID/salt-queue --message-body "{\"filename\": $i}"; done

Then to start salt-master

No messages will be processed by reactors:

$ ls /tmp/test_salt
ls: cannot access '/tmp/test_salt': No such file or directory

Logs (relevant part)

[...]
2018-08-29 15:55:03,584 [salt.utils.lazy  :100 ][DEBUG   ][2694] LazyLoaded sqs_events.start
2018-08-29 15:55:03,584 [salt.engines     :53  ][INFO    ][2694] Starting Engine salt.engines.Engine(salt.loaded.int.engines.sqs_events)
2018-08-29 15:55:03,586 [salt.utils.process:418 ][DEBUG   ][2694] Started 'salt.engines.Engine(salt.loaded.int.engines.sqs_events)' with pid 2709
2018-08-29 15:55:03,589 [salt.utils.lazy  :100 ][DEBUG   ][2694] LazyLoaded reactor.start
2018-08-29 15:55:03,590 [salt.engines     :53  ][INFO    ][2694] Starting Engine salt.engines.Engine(salt.loaded.int.engines.reactor)
2018-08-29 15:55:03,592 [salt.utils.process:418 ][DEBUG   ][2694] Started 'salt.engines.Engine(salt.loaded.int.engines.reactor)' with pid 2710
2018-08-29 15:55:03,598 [salt.utils.lazy  :100 ][DEBUG   ][2709] LazyLoaded sqs_events.start
2018-08-29 15:55:03,600 [salt.utils.event :321 ][DEBUG   ][2709] MasterEvent PUB socket URI: /var/run/salt/master/master_event_pub.ipc
2018-08-29 15:55:03,600 [salt.utils.event :324 ][DEBUG   ][2709] MasterEvent PULL socket URI: /var/run/salt/master/master_event_pull.ipc
2018-08-29 15:55:03,631 [boto             :383 ][DEBUG   ][2709] Retrieving credentials from metadata server.
2018-08-29 15:55:03,621 [salt.utils.lazy  :100 ][DEBUG   ][2710] LazyLoaded reactor.start
2018-08-29 15:55:03,638 [salt.config      :2117][DEBUG   ][2710] Reading configuration from /etc/salt/master
2018-08-29 15:55:03,646 [boto             :404 ][DEBUG   ][2709] Retrieved credentials will expire in 5:06:27.353955 at: 2018-08-29T21:01:31Z
2018-08-29 15:55:03,724 [boto             :897 ][DEBUG   ][2709] Method: GET
2018-08-29 15:55:03,724 [boto             :898 ][DEBUG   ][2709] Path: /ACCOUNT_ID/salt-queue
2018-08-29 15:55:03,725 [boto             :899 ][DEBUG   ][2709] Data:
2018-08-29 15:55:03,726 [boto             :900 ][DEBUG   ][2709] Headers: {}
2018-08-29 15:55:03,726 [boto             :901 ][DEBUG   ][2709] Host: eu-central-1.queue.amazonaws.com
2018-08-29 15:55:03,727 [boto             :902 ][DEBUG   ][2709] Port: 443
2018-08-29 15:55:03,750 [boto             :1172][DEBUG   ][2709] <?xml version="1.0"?><ReceiveMessageResponse xmlns="http://queue.amazonaws.com/doc/2012-11-05/"><ReceiveMessageResult><Message><MessageId>9b3789ec-cba0-4e5e-a3e3-6ca0d4ceda13</MessageId><ReceiptHandle>...</ReceiptHandle><MD5OfBody>aea163445e8c67e6d22ded8aa8dab526</MD5OfBody><Body>{&quot;filename&quot;: 6}</Body></Message></ReceiveMessageResult><ResponseMetadata><RequestId>7b63582f-dbc4-5af7-be4d-0c2f1fd39f06</RequestId></ResponseMetadata></ReceiveMessageResponse>
2018-08-29 15:55:03,751 [salt.transport.ipc:254 ][DEBUG   ][2709] Initializing new IPCClient for path: /var/run/salt/master/master_event_pull.ipc
2018-08-29 15:55:03,763 [salt.utils.event :739 ][DEBUG   ][2709] Sending event: tag = salt/engine/sqs; data = {u'message': u'{"filename": 6}', u'_stamp': '2018-08-29T15:55:03.763372'}
[Idem with all messages]
2018-08-29 15:55:03,961 [salt.config      :2117][DEBUG   ][2710] Reading configuration from /etc/salt/master
[Reading all configuration files]
2018-08-29 15:55:06,083 [salt.loaded.int.grains.core:598 ][DEBUG   ][2710] Please install 'virt-what' to improve results of the 'virtual' grain.
2018-08-29 15:55:06,281 [salt.loaded.int.grains.extra:70  ][DEBUG   ][2710] Loading static grains from /etc/salt/grains
2018-08-29 15:55:06,497 [salt.utils.lazy  :100 ][DEBUG   ][2710] LazyLoaded jinja.render
2018-08-29 15:55:06,502 [salt.utils.lazy  :100 ][DEBUG   ][2710] LazyLoaded yaml.render
2018-08-29 15:55:06,504 [salt.utils.event :321 ][DEBUG   ][2710] MasterEvent PUB socket URI: /var/run/salt/master/master_event_pub.ipc
2018-08-29 15:55:06,506 [salt.utils.event :324 ][DEBUG   ][2710] MasterEvent PULL socket URI: /var/run/salt/master/master_event_pull.ipc
2018-08-29 15:55:06,506 [salt.transport.ipc:254 ][DEBUG   ][2710] Initializing new IPCClient for path: /var/run/salt/master/master_event_pub.ipc
2018-08-29 15:55:12,693 [salt.utils.reactor:109 ][DEBUG   ][2710] Gathering reactors for tag salt/auth
2018-08-29 15:55:22,707 [salt.utils.reactor:109 ][DEBUG   ][2710] Gathering reactors for tag salt/auth
[...]

Restart

If we test to restart salt-master during message sending:

for i in $(seq 20); do aws sqs send-message --queue-url https://eu-central-1.queue.amazonaws.com/ACCOUNT_ID/salt-queue --message-body "{\"filename\": $i}"; done

Then just after messages started to be sent:

systemctl restart salt-master

We can see that some messages have been lost:

$ ls -1 /tmp/test_salt/ | sort -n
1
2
3
15
16
17
18
19
20

Versions Report

(also tested on Salt 2016.11.X)

Salt Version:
           Salt: 2018.3.2

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 2.5.3
      docker-py: Not Installed
          gitdb: 2.0.0
      gitpython: 2.1.1
          ioflo: Not Installed
         Jinja2: 2.9.4
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.8
   mysql-python: 1.3.7
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 2.7.13 (default, Nov 24 2017, 17:33:09)
   python-gnupg: Not Installed
         PyYAML: 3.12
          PyZMQ: 16.0.2
           RAET: Not Installed
          smmap: 2.0.1
        timelib: Not Installed
        Tornado: 4.4.3
            ZMQ: 4.2.1

System Versions:
           dist: debian 9.5 
         locale: UTF-8
        machine: x86_64
        release: 4.9.0-8-amd64
         system: Linux
        version: debian 9.5

Ch3LL commented 6 years ago

ping @saltstack/team-core any ideas on how to solve the issue where hte engine and events start sending before the reactor is up

gtmanfred commented 6 years ago

We would need to add the ability to order the starting of engines. To make sure the reactor engine is up before the sqs engine.

srinivassundar commented 5 years ago

hi team, any update on this issue ? i am facing the same problem. root@salt:~# salt --versions-report Salt Version: Salt: 2018.3.3

Dependency Versions: cffi: 1.11.5 cherrypy: unknown dateutil: 2.7.5 docker-py: Not Installed gitdb: 2.0.3 gitpython: 2.1.8 ioflo: Not Installed Jinja2: 2.10 libgit2: Not Installed libnacl: Not Installed M2Crypto: Not Installed Mako: 1.0.7 msgpack-pure: Not Installed msgpack-python: 0.5.6 mysql-python: Not Installed pycparser: 2.19 pycrypto: 2.6.1 pycryptodome: 3.7.2 pygit2: Not Installed Python: 2.7.15rc1 (default, Nov 12 2018, 14:31:15) python-gnupg: 0.4.1 PyYAML: 3.12 PyZMQ: 16.0.2 RAET: Not Installed smmap: 2.0.3 timelib: Not Installed Tornado: 4.5.3 ZMQ: 4.2.5

System Versions: dist: Ubuntu 18.04 bionic locale: UTF-8 machine: x86_64 release: 4.15.0-1021-aws system: Linux version: Ubuntu 18.04 bionic

gtmanfred commented 5 years ago

afaik there is no work currently being done allowing ordering of engine starts.

But you could try directly specifying the reactor in the list of engines.

log_level: debug
engines:
  - reactor: {}
  - sqs_events:
    queue: salt-queue
sqs:
  message_format: json
sqs.region: eu-central-1
reactor:
- salt/engine/sqs:
  - salt://test_sqs_events.sls

And see if explicitly putting the reactor engine before the sqs events engine causes the reactor to be started first.

The way that the reactor is started now is if you have a reactor key in the master config, it adds the reactor engine to the list of engines, and starts them in order.

srinivassundar commented 5 years ago

@gtmanfred ordering the engine starts doesn't seem to help. is there any other solution?

thanks Srinivas

gtmanfred commented 5 years ago

nope, it would require a change to allow for ordering of the engines to start, and something to signify that an engine has finished starting before the next one is started.

and that would be a feature request if @saltstack/team-triage will tag it.

thanks

srinivassundar commented 5 years ago

thanks, Daniel,

it would be great if @saltstack/team-triage https://github.com/orgs/saltstack/teams/team-triage can tag it and provide a solution/fix at the earliest.

garethgreenaway commented 5 years ago

Since this is asking for new functionality, the ability to specify some dependencies in what order engines load, which doesn't currently exist I'll go ahead and label it as a feature request and approve it for a future feature addition.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

mchugh19 commented 4 years ago

Still valid

stale[bot] commented 4 years ago

Thank you for updating this issue. It is no longer marked as stale.

saltstack / salt