saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Install Salt from the Salt package repositories here:
https://docs.saltproject.io/salt/install-guide/en/latest/
Apache License 2.0
14.21k stars 5.48k forks source link

All of the returners (excluding mysql and local_cache) are not handling the "req" job id correctly #51809

Open arizvisa opened 5 years ago

arizvisa commented 5 years ago

Description of Issue/Question

Hiya. I'm the author of PR #51363 (etcd returner), and I notice that the mysql returner specially handles the case when the job id is set to "req". This job id seems to be submitted by a minion when it is activated independant of a command from the master. (Relevant code: https://github.com/saltstack/salt/blob/develop/salt/returners/mysql.py#L294)

The way this magic job-id is handled by the returner seems to be that when it is seen, to replace it with a new job id generated from returner.prep_jid. However, it appears that only the mysql and local_cache returners handle this job id explicitly, and everything else does not tamper with it at all. Which returner is handling it correctly?

Is this something that a returner is actually not supposed to receive and a result of a bug or misconfiguration? Most of the logic that messes with this magic job-id immediately replaces it with a call to prep_jid, however there are a few cases where this logic is missed which results in a returner receiving this 'req' job...

Setup

To view the job-id in use, simply set the startup_states in a minion to execute some state that takes a while (so that you can race against the state being applied). Configure any of the non-local returners so that you can manually check the job being created.

Steps to Reproduce Issue

As the state is being applied, when you view the active jobs salt-run jobs.active you'll see that the job id for the state that was started by the minion has the job-id of "req".

Versions Report

Salt Version:
           Salt: 2019.2.0

Dependency Versions:
           cffi: 1.11.5
       cherrypy: Not Installed
       dateutil: 2.8.0
      docker-py: Not Installed
          gitdb: 2.0.3
      gitpython: 2.1.11
          ioflo: 1.7.5
         Jinja2: 2.10
        libgit2: 0.27.8
        libnacl: 1.6.1
       M2Crypto: Not Installed
           Mako: 1.0.7
   msgpack-pure: Not Installed
 msgpack-python: 0.5.6
   mysql-python: Not Installed
      pycparser: 2.14
       pycrypto: 2.6.1
   pycryptodome: 3.7.0
         pygit2: 0.27.4
         Python: 2.7.15 (default, Oct 15 2018, 15:26:09)
   python-gnupg: Not Installed
         PyYAML: 4.2
          PyZMQ: 17.0.0
           RAET: 0.6.8
          smmap: 2.0.3
        timelib: Not Installed
        Tornado: 5.0.2
            ZMQ: 4.1.6

System Versions:
           dist: fedora 29 Twenty Nine
         locale: ANSI_X3.4-1968
        machine: x86_64
        release: 4.14.96-coreos
         system: Linux
        version: Fedora 29 Twenty Nine
arizvisa commented 5 years ago

It looks like the reason why returners might get req as the job instead of a real job id is because of the following code. But again, I'm really unfamiliar with this beastly code-base so I'm not 100% sure:

https://github.com/saltstack/salt/blob/develop/salt/daemons/masterapi.py#L780

    def _return(self, load):
        '''
        Handle the return data sent from the minions
        '''
...
        if load['jid'] == 'req':
            # The minion is returning a standalone job, request a jobid
            prep_fstr = '{0}.prep_jid'.format(self.opts['master_job_cache'])
            load['jid'] = self.mminion.returners[prep_fstr](nocache=load.get('nocache', False))

            # save the load, since we don't have it
            saveload_fstr = '{0}.save_load'.format(self.opts['master_job_cache'])
            self.mminion.returners[saveload_fstr](load['jid'], load)
...
        if not self.opts['job_cache'] or self.opts.get('ext_job_cache'):    # XXX: is this right?
            return

        fstr = '{0}.update_endtime'.format(self.opts['master_job_cache'])
        if (self.opts.get('job_cache_store_endtime')
                and fstr in self.mminion.returners):
            self.mminion.returners[fstr](load['jid'], endtime)

        fstr = '{0}.returner'.format(self.opts['master_job_cache'])
self.mminion.returners[fstr](load)

You can see the conditional at XXX where it's checking to see if the caches don't exist, however it looks like the test for both job_cache and ext_job_cache intended to be wrapped in parentheses because this line is essentially saying that if there's _not_ a job cache or there _is_ an external job cache, then ignore the following code which calls the returner() function and uses the master_job_cache.

Is this intentional, or was it really supposed to be the following which results in the returner being called if there's an external job cache configured?

if not (self.opts['job_cache'] or self.opts.get('ext_job_cache')):
Ch3LL commented 5 years ago

it looks like adding req to the jid was on purpose. If you can see this PR: https://github.com/saltstack/salt/pull/16392

if you see this comment: https://github.com/saltstack/salt/pull/16392#issuecomment-58903622 looks like we need to add that code handling the req return to all the returners.

arizvisa commented 5 years ago

Ok. Because yeah, if that explicit check doesn't exist then it results in the load data being overwritten since all jids spawned from a minion are 'req'. I'll update my PR then.

Thanks!

arizvisa commented 5 years ago

Ftr, that unfortunately means that all the returners (other than mysql and local_cache) are busted. Maybe not totally busted actually since there's no error (just silence), but essentially any jobs spawned by minions will cease to exist..

Should I create a new issue pointing that out, or perhaps re-open this and change the title?

Ch3LL commented 5 years ago

lets re-title this and track here :)

arizvisa commented 5 years ago

Okay, cool. Re-titled it.

As another question, is there some way to see the 'req' jobs that are currently being run in a salt environment? Although returners should re-map the 'req' to a new job id, I've been unable to figure out how to view that new job id without actually watching the event queue.. If we go to the prior logic where 'req' is not re-mapped, then you can view the job...but if more than one minion makes a 'req' then the previous results are overwritten.

arizvisa commented 5 years ago

@Ch3LL, do you mind re-tagging this since it isn't a question anymore?

arizvisa commented 5 years ago

Ftr, PR #51363 fixes this issue just for the etcd returner. So once/if that gets merged there's one less returner to fix.

Ch3LL commented 5 years ago

if we go to the prior logic where 'req' is not re-mapped, then you can view the job

How were you previously viewing it when it was working?

arizvisa commented 5 years ago

By using 'req' as the job id. However anytime another client started another job, it would overwrite the previous job data for 'req'.

Ch3LL commented 5 years ago

when you say overwrite do you mean its overwriting the req job in etcd where the job returns are stored?

arizvisa commented 5 years ago

yes. it's ovewriting because the job-id is 'req', so any returner that lets you use a non-integral job id will be overwritten if the job id is being used as a unique key. offhand i think cassandra does this, but there's probably others.

arizvisa commented 5 years ago

what might work as a workaround would be to introduce minion-specific job ids. This way job ids can be unique which corresponds to the assumptions that the sql-based returners make when creating an index for the job id.

Something like req-${MINION_ID}-${TIMESTAMP} or something similar since it seems that once a job id hits the returner, salt doesn't seem to care about it anymore (not sure about this). But this way the job id is still searchable via returner.list_jobs (and friends).

This'll definitely need some discussion though as I'm really only familiar with the returner interfaces from my work on etcd. Pretty much the most complete returners are the sql-based one, the local_cache one, the write-only ones (email-based and such), and the etcd one (afaict, and whenever it gets merged).

Ch3LL commented 5 years ago

thanks for clarifying. wanted to make sure i was following correctly.

ping @saltstack/team-core any ideas on how we could approach this?

waynew commented 5 years ago

Apparently I need to dig into returner code, because it's a place I'm not familiar with at all right now 😛

arizvisa commented 5 years ago

@waynew, if you want to check my PR for etcd support in the returners. I'd be welling to bet that it's the most complete and documented of the returner implementations.

https://github.com/saltstack/salt/pull/51363

Here's a link to the specific file in my submitted branch: https://github.com/arizvisa/saltstack-salt/blob/etcd-returner-is-busted/salt/returners/etcd_return.py

arizvisa commented 5 years ago

If you use the etcd returner PR, it's very easy to monitor what a client is doing as etcd simplifies monitoring its schema via etcdclient. You can monitor for "/$jid/$id/$num" (or similar, i haven't messed with this in a long time) where your jid is "req" or a result from "prep_jid", and your "num" is an incrementing number for the state that the client is processing.

Otherwise you'll need to use state.event to watch what's happening as jobs get returned and keep an eye for "$jid" which can be "req" or a jid depending on the issue i mentioned.

Or you can also just plant logs in the returner that you're using (non-local ones).

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

arizvisa commented 4 years ago

This is referenced by #52775 and seems to be the only mention of the special handling of the "req" jid by the returners. Each returner has to explicitly handle this job and so this should probably be left open.

Relevent documentation for how the etcd2 returner deals with this special case can be found in the documentation for the salt.returner.etcd_return module merged by PR #51363.

stale[bot] commented 4 years ago

Thank you for updating this issue. It is no longer marked as stale.