saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Get access to the Salt software package repository here:
https://repo.saltproject.io/
Apache License 2.0
14.09k stars 5.47k forks source link

Schedule -> Orchestrate results in traceback #20065

Closed iggy closed 9 years ago

iggy commented 9 years ago

We have some orchestrate jobs that are triggered by Salt's scheduler. They previously worked on 2014.1. We are now testing 2015.2 and they are failing with some interesting tracebacks that I think are related.

2015-01-26 17:13:03,288 [salt.utils.schedule                        ][ERROR   ][14175] Unhandled exception running state.orchestrate
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/salt/utils/schedule.py", line 502, in handle_func
    ret['return'] = self.functions[func](*args)
  File "/usr/lib/python2.7/dist-packages/salt/runners/state.py", line 116, in orchestrate
    __jid_event__.fire_event({'data': ret, 'outputter': 'highstate'}, 'progress')
NameError: global name '__jid_event__' is not defined
2015-01-26 17:13:03,288 [salt.utils.schedule                        ][ERROR   ][14172] Unhandled exception running state.orchestrate
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/salt/utils/schedule.py", line 502, in handle_func
    ret['return'] = self.functions[func](*args)
  File "/usr/lib/python2.7/dist-packages/salt/runners/state.py", line 116, in orchestrate
    __jid_event__.fire_event({'data': ret, 'outputter': 'highstate'}, 'progress')
NameError: global name '__jid_event__' is not defined

and

2015-01-26 19:27:36,701 [salt.utils.schedule                        ][ERROR   ][20041] Unhandled exception running state.orchestrate
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/salt/utils/schedule.py", line 502, in handle_func
    ret['return'] = self.functions[func](*args)
  File "/usr/lib/python2.7/dist-packages/salt/runners/state.py", line 114, in orchestrate
    pillar=pillar)
  File "/usr/lib/python2.7/dist-packages/salt/modules/state.py", line 593, in sls
    opts = _get_opts(kwargs.get('localconfig'))
  File "/usr/lib/python2.7/dist-packages/salt/modules/state.py", line 152, in _get_opts
    opts = copy.deepcopy(__opts__)
  File "/usr/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python2.7/copy.py", line 257, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/lib/python2.7/copy.py", line 182, in deepcopy
    rv = reductor(2)
  File "/usr/lib/python2.7/multiprocessing/sharedctypes.py", line 213, in __reduce__
    assert_spawning(self)
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 51, in assert_spawning
    ' through inheritance' % type(self).__name__
RuntimeError: SynchronizedString objects should only be shared between processes through inheritance
bjackson@dev-salt01:~$ salt-master --versions
           Salt: 2015.2.0-1345-gd7a0405
         Python: 2.7.3 (default, Mar 13 2014, 11:03:55)
         Jinja2: 2.6
       M2Crypto: 0.21.1
 msgpack-python: 0.1.10
   msgpack-pure: Not Installed
       pycrypto: 2.6
        libnacl: Not Installed
         PyYAML: 3.10
          ioflo: Not Installed
          PyZMQ: 13.1.0
           RAET: Not Installed
            ZMQ: 3.2.3
           Mako: Not Installed

Let me know if you need any more info.

basepi commented 9 years ago

Likely related to the runner rewrite currently in progress by @cachedout and @jacksontj. We just need to make sure orchestrate is back to working after that is finished.

iggy commented 9 years ago

/etc/salt/master.d/63-scheduler.conf

schedule:
  orchestrate-db-backups:
    function: state.orchestrate
    args:
      - backups.db.orchestrate
    hours: 2

/srv/salt/backups/db/orchestrate

coalesce_db:
  salt.function:
    - name: postgres.psql_query
    - tgt: 'tags:db'
    - tgt_type: grain
    - arg:
      - "select pg_start_backup('foo');"

db_snapshot:
  salt.function:
    - name: cmd.run
    - tgt: 'tags:master'
    - tgt_type: grain
    - arg:
      - /deploy/cloud.py snapshot --tags db -- dev

unfreeze_db:
  salt.function:
    - name: postgres.psql_query
    - tgt: 'tags:db'
    - tgt_type: grain
    - arg:
      - "select pg_stop_backup();"

Some command output that basepi asked for:

sudo salt-run state.orch backups.db.orchestrate  
[ERROR   ] {'ret': {'dev-arcee01.c.oncenter-XXX': []}, 'out': 'highstate'}
[ERROR   ] {'ret': {'dev-arcee01.c.oncenter-XXX': []}, 'out': 'highstate'}
dev-salt01.c.oncenter-XXX:
----------
          ID: coalesce_db
    Function: salt.function
        Name: postgres.psql_query
      Result: False
     Comment: Running function postgres.psql_query failed on minions: dev-arcee01.c.oncenter-XXX Function postgres.psql_query ran on dev-arcee01.c.oncenter-XXX.
     Started: 21:42:31.215361
    Duration: 578.537 ms
     Changes:   
              dev-arcee01.c.oncenter-XXX:
                  Data failed to compile:
----------
          ID: db_snapshot
    Function: salt.function
        Name: cmd.run
      Result: True
     Comment: Function ran successfully. Function cmd.run ran on dev-salt01.c.oncenter-XXX.
     Started: 21:42:31.794156
    Duration: 26485.923 ms
     Changes:   
              dev-salt01.c.oncenter-XXX:
                  Namespace(auth_host_name='localhost', auth_host_port=[8080, 8090], disks=None, env='dev', func=<function snapshot at 0x2a16230>, logging_level='ERROR', noauth_local_webserver=False, pretend=False, tags=['db'])
              ['db']
              list: XXX us-central1-a name eq .*-db.*
              {u'items': [{u'status': u'READY', u'sizeGb': u'50', u'kind': u'compute#disk', u'name': u'dev-arcee01--db-d2', u'zone': u'https://www.googleapis.com/compute/v1/projects/XXX/zones/us-central1-a', u'type': u'https://www.googleapis.com/compute/v1/projects/XXX/zones/us-central1-a/diskTypes/pd-standard', u'creationTimestamp': u'2015-01-23T09:25:27.659-08:00', u'id': u'17231275283433604350', u'selfLink': u'https://www.googleapis.com/compute/v1/projects/XXX/zones/us-central1-a/disks/dev-arcee01--db-d2'}], u'kind': u'compute#diskList', u'id': u'projects/XXX/zones/us-central1-a/disks', u'selfLink': u'https://www.googleapis.com/compute/v1/projects/XXX/zones/us-central1-a/disks'}
              [u'dev-arcee01--db-d2']
              Snapshotting dev-arcee01--db-d2
              {u'status': u'DONE', u'kind': u'compute#operation', u'name': u'operation-1422308552572-50d9504097c60-58b4ed6e-d672913c', u'zone': u'https://www.googleapis.com/compute/v1/projects/XXX/zones/us-central1-a', u'startTime': u'2015-01-26T13:42:33.026-08:00', u'insertTime': u'2015-01-26T13:42:32.672-08:00', u'targetId': u'17231275283433604350', u'targetLink': u'https://www.googleapis.com/compute/v1/projects/XXX/zones/us-central1-a/disks/dev-arcee01--db-d2', u'operationType': u'createSnapshot', u'progress': 100, u'endTime': u'2015-01-26T13:42:56.419-08:00', u'id': u'1864162870866048885', u'selfLink': u'https://www.googleapis.com/compute/v1/projects/XXX/zones/us-central1-a/operations/operation-1422308552572-50d9504097c60-58b4ed6e-d672913c', u'user': u'XXX'}
----------
          ID: unfreeze_db
    Function: salt.function
        Name: postgres.psql_query
      Result: False
     Comment: Running function postgres.psql_query failed on minions: dev-arcee01.c.oncenter-XXX Function postgres.psql_query ran on dev-arcee01.c.oncenter-XXX.
     Started: 21:42:58.280271
    Duration: 159.585 ms
     Changes:   
              dev-arcee01.c.oncenter-XXX:
                  Data failed to compile:

Summary
------------
Succeeded: 1 (changed=1)
Failed:    2
------------
Total states run:     3
[ERROR   ] Invalid outputter  specified, fall back to nested
dev-salt01.c.oncenter-XXX:
    ----------
    salt_|-coalesce_db_|-postgres.psql_query_|-function:
        ----------
        __run_num__:
            0
        changes:
            ----------
            out:
                highstate
            ret:
                ----------
                dev-arcee01.c.oncenter-XXX:
        comment:
            Running function postgres.psql_query failed on minions: dev-arcee01.c.oncenter-XXX Function postgres.psql_query ran on dev-arcee01.c.oncenter-XXX.
        duration:
            578.537
        name:
            postgres.psql_query
        result:
            False
        start_time:
            21:42:31.215361
    salt_|-db_snapshot_|-cmd.run_|-function:
        ----------
        __run_num__:
            1
        changes:
            ----------
            out:
                highstate
            ret:
                ----------
                dev-salt01.c.oncenter-XXX:
                    Namespace(auth_host_name='localhost', auth_host_port=[8080, 8090], disks=None, env='dev', func=<function snapshot at 0x2a16230>, logging_level='ERROR', noauth_local_webserver=False, pretend=False, tags=['db'])
                    ['db']
                    list: XXX name eq .*-db.*
                    {u'items': [{u'status': u'READY', u'sizeGb': u'50', u'kind': u'compute#disk', u'name': u'dev-arcee01--db-d2', u'zone': u'https://www.googleapis.com/compute/v1/projects/XXX/zones/us-central1-a', u'type': u'https://www.googleapis.com/compute/v1/projects/XXX/zones/us-central1-a/diskTypes/pd-standard', u'creationTimestamp': u'2015-01-23T09:25:27.659-08:00', u'id': u'17231275283433604350', u'selfLink': u'https://www.googleapis.com/compute/v1/projects/XXX/zones/us-central1-a/disks/dev-arcee01--db-d2'}], u'kind': u'compute#diskList', u'id': u'projects/XXX/zones/us-central1-a/disks', u'selfLink': u'https://www.googleapis.com/compute/v1/projects/XXX/zones/us-central1-a/disks'}                                                                                                                                                                                                              
                    [u'dev-arcee01--db-d2']
                    Snapshotting dev-arcee01--db-d2
                    {u'status': u'DONE', u'kind': u'compute#operation', u'name': u'operation-1422308552572-50d9504097c60-58b4ed6e-d672913c', u'zone': u'https://www.googleapis.com/compute/v1/projects/XXX/zones/us-central1-a', u'startTime': u'2015-01-26T13:42:33.026-08:00', u'insertTime': u'2015-01-26T13:42:32.672-08:00', u'targetId': u'17231275283433604350', u'targetLink': u'https://www.googleapis.com/compute/v1/projects/XXX/zones/us-central1-a/disks/dev-arcee01--db-d2', u'operationType': u'createSnapshot', u'progress': 100, u'endTime': u'2015-01-26T13:42:56.419-08:00', u'id': u'1864162870866048885', u'selfLink': u'https://www.googleapis.com/compute/v1/projects/XXX/zones/us-central1-a/operations/operation-1422308552572-50d9504097c60-58b4ed6e-d672913c', u'user': XXX@oncenter.com'}                                                                                                                                                                               
        comment:
            Function ran successfully. Function cmd.run ran on dev-salt01.c.oncenter-XXX.
        duration:
            26485.923
        name:
            cmd.run
        result:
            True
        start_time:
            21:42:31.794156
    salt_|-unfreeze_db_|-postgres.psql_query_|-function:
        ----------
        __run_num__:
            2
        changes:
            ----------
            out:
                highstate
            ret:
                ----------
                dev-arcee01.c.oncenter-XXX:
        comment:
            Running function postgres.psql_query failed on minions: dev-arcee01.c.oncenter-XXX Function postgres.psql_query ran on dev-arcee01.c.oncenter-XXX.
        duration:
            159.585
        name:
            postgres.psql_query
        result:
            False
        start_time:
            21:42:58.280271
iggy commented 9 years ago

Okay. Well, just let me know if you guys need anything tested. This is just in dev and doesn't completely destroy our workflow so I can live with it for now.

jacksontj commented 9 years ago

Yea, this should be related. The work is going on in #19729

ghost commented 9 years ago

Okay.

iggy commented 9 years ago

19729 is closed now (and I've updated). I'm still getting this traceback.

2015-01-28 21:53:04,213 [salt.utils.schedule                        ][ERROR   ][879] Unhandled exception running state.orchestrate
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/salt/utils/schedule.py", line 502, in handle_func
    ret['return'] = self.functions[func](*args)
  File "/usr/lib/python2.7/dist-packages/salt/runners/state.py", line 114, in orchestrate
    pillar=pillar)
  File "/usr/lib/python2.7/dist-packages/salt/modules/state.py", line 593, in sls
    opts = _get_opts(kwargs.get('localconfig'))
  File "/usr/lib/python2.7/dist-packages/salt/modules/state.py", line 152, in _get_opts
    opts = copy.deepcopy(__opts__)
  File "/usr/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python2.7/copy.py", line 257, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/lib/python2.7/copy.py", line 182, in deepcopy
    rv = reductor(2)
  File "/usr/lib/python2.7/multiprocessing/sharedctypes.py", line 213, in __reduce__
    assert_spawning(self)
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 51, in assert_spawning
    ' through inheritance' % type(self).__name__
RuntimeError: SynchronizedString objects should only be shared between processes through inheritance
$ salt-master --versions
           Salt: 2015.2.0-1629-g89d47fe
         Python: 2.7.3 (default, Mar 13 2014, 11:03:55)
         Jinja2: 2.6
       M2Crypto: 0.21.1
 msgpack-python: 0.1.10
   msgpack-pure: Not Installed
       pycrypto: 2.6
        libnacl: Not Installed
         PyYAML: 3.10
          ioflo: Not Installed
          PyZMQ: 13.1.0
           RAET: Not Installed
            ZMQ: 3.2.3
           Mako: Not Installed
jacksontj commented 9 years ago

20233 should fix this.

iggy commented 9 years ago

Now there's another issue. Not sure if it's related...

2015-01-30 16:43:44,269 [salt.utils.schedule                        ][ERROR   ][5248] Unhandled exception running state.orchestrate
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/salt/utils/schedule.py", line 502, in handle_func
    ret['return'] = self.functions[func](*args)
  File "/usr/lib/python2.7/dist-packages/salt/runners/state.py", line 116, in orchestrate
    __jid_event__.fire_event({'data': ret, 'outputter': 'highstate'}, 'progress')
NameError: global name '__jid_event__' is not defined
jacksontj commented 9 years ago

Related to the runner re-work, I can reproduce-- should have a fix in a bit here.

rallytime commented 9 years ago

Thanks @jacksontj!

iggy commented 9 years ago

After going back and looking, this was actually there before, I just didn't notice.

20233 did fix the other problem though, so thanks for that.

And in general, thanks for taking care of this issue quickly.

jacksontj commented 9 years ago

Yea, this was a regression from @cachedout 's initial implementation, but it would seem that we don't have any test coverage on the master side scheduler-- since its completely busted ;)

jacksontj commented 9 years ago

@iggy I pushed the fix into my other PR (#20233). Works for me locally, just want to let the test suite have a crack at it to see if I introduced any regressions.

Long story short the refactor to this event stuff requires some magic before you can call the function. So I added an API to get a "functions" (dict of name -> func) like interface to runners/wheels.

iggy commented 9 years ago

Looks good. All my error messages are gone, and all my jobs are running as expected.

jacksontj commented 9 years ago

@iggy It has been merged, we should be able to close this out :)