saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Get access to the Salt software package repository here:
https://repo.saltproject.io/
Apache License 2.0
14.15k stars 5.48k forks source link

[BUG] schedule splay and maxrunning are ignored when using an inverted range #65336

Open pckroon opened 1 year ago

pckroon commented 1 year ago

Description The goal is to run a highstate on my minions during night, and I made a schedule for this in my pillar data. However, when too many minions run highstates (in particular apt-get update), my apt-cacher server gets a bad day and hangs the jobs. My workaround solution is to add a splay and maxrunning to the salt schedule, but the next morning all jobs are hung, and they all started at the same second.

Setup

schedule:
  highstate:
    function: state.highstate
    minutes: 60
    splay:
      start: 300
      end: 3600
    maxrunning: 5
    jid_include: True
    run_on_start: False
    range:
      invert: True
      start: '05:00'
      end:   '18:00'

The salt-master is on an LXC VM, the minions are on-prem machines, all running debian 12 using the bullseye onedir.

Expected behavior Not all jobs firing at the exact same time

Versions Report

(master 3006.3) salt --versions-report ```yaml Salt Version: Salt: 3006.3 Python Version: Python: 3.10.13 (main, Sep 6 2023, 02:11:27) [GCC 11.2.0] Dependency Versions: cffi: 1.14.6 cherrypy: unknown dateutil: 2.8.1 docker-py: Not Installed gitdb: Not Installed gitpython: Not Installed Jinja2: 3.1.2 libgit2: Not Installed looseversion: 1.0.2 M2Crypto: Not Installed Mako: Not Installed msgpack: 1.0.2 msgpack-pure: Not Installed mysql-python: Not Installed packaging: 22.0 pycparser: 2.21 pycrypto: Not Installed pycryptodome: 3.9.8 pygit2: Not Installed python-gnupg: 0.4.8 PyYAML: 6.0.1 PyZMQ: 23.2.0 relenv: 0.13.10 smmap: Not Installed timelib: 0.2.4 Tornado: 4.5.3 ZMQ: 4.3.4 System Versions: dist: debian 12 bookworm locale: utf-8 machine: x86_64 release: 5.10.0-11-amd64 system: Linux version: Debian GNU/Linux 12 bookworm ```
(minion 3006.2) salt minion1 test.versions ```yaml Salt Version: Salt: 3006.2 Python Version: Python: 3.10.12 (main, Aug 3 2023, 21:47:10) [GCC 11.2.0] Dependency Versions: cffi: 1.14.6 cherrypy: 18.6.1 dateutil: 2.8.1 docker-py: Not Installed gitdb: Not Installed gitpython: Not Installed Jinja2: 3.1.2 libgit2: Not Installed looseversion: 1.0.2 M2Crypto: Not Installed Mako: Not Installed msgpack: 1.0.2 msgpack-pure: Not Installed mysql-python: Not Installed packaging: 22.0 pycparser: 2.21 pycrypto: Not Installed pycryptodome: 3.9.8 pygit2: Not Installed python-gnupg: 0.4.8 PyYAML: 6.0.1 PyZMQ: 23.2.0 relenv: 0.13.3 smmap: Not Installed timelib: 0.2.4 Tornado: 4.5.3 ZMQ: 4.3.4 System Versions: dist: debian 12 bookworm locale: utf-8 machine: x86_64 release: 6.1.0-11-amd64 system: Linux version: Debian GNU/Linux 12 bookworm ```
(minion 3006.3) salt minion2 test.versions ```yaml Salt Version: Salt: 3006.3 Python Version: Python: 3.10.13 (main, Sep 6 2023, 02:11:27) [GCC 11.2.0] Dependency Versions: cffi: 1.14.6 cherrypy: 18.6.1 dateutil: 2.8.1 docker-py: Not Installed gitdb: Not Installed gitpython: Not Installed Jinja2: 3.1.2 libgit2: Not Installed looseversion: 1.0.2 M2Crypto: Not Installed Mako: Not Installed msgpack: 1.0.2 msgpack-pure: Not Installed mysql-python: Not Installed packaging: 22.0 pycparser: 2.21 pycrypto: Not Installed pycryptodome: 3.9.8 pygit2: Not Installed python-gnupg: 0.4.8 PyYAML: 6.0.1 PyZMQ: 23.2.0 relenv: 0.13.10 smmap: Not Installed timelib: 0.2.4 Tornado: 4.5.3 ZMQ: 4.3.4 System Versions: dist: debian 12 bookworm locale: utf-8 machine: x86_64 release: 6.1.0-11-amd64 system: Linux version: Debian GNU/Linux 12 bookworm ```

Additional context Yes, I know the real problem is my apt cache server hanging apt update jobs.

EDIT: formatting

whytewolf commented 1 year ago

this looks like a minion schedule in which case maxrunning wouldn't help anyway. maxrunning is about how many parallel jobs run on each minion, not how many minions run the schedule. minions only care about what they know.

but looking at the setup. you have each minion building up a set of schedules each run between 300 to 3600 seconds after being asked to run. and every hour it will add another one to the total of 5 each minion is allowed to run at the same time, which state will allow only 1 to run anyway. [this is what those settings mean.] change maxrunning to 1.

can you get a log with trace level logging of the minion it should show when the job ran as well as when it was expected to run. right now only the splay is in question. maxrunning again is about parallel jobs on the same minion.

pckroon commented 1 year ago

Thanks for the quick reply, and clarifying the meaning of maxrunning. I set that to 1 for now. I also set log_level_logfile: trace for my favourite minion and restarted the service. I'll get back tomorrow with the outcome.

pckroon commented 1 year ago
2023-10-04 13:03:28,141 [salt.utils.event :823 ][DEBUG   ][2805748] Sending event: tag = /salt/minion/minion_schedule_add_complete; data = {'complete': True, 'schedule': {'highstate': {'function': 'state.highstate', 'minutes': 60, 'splay': {'start': 300, 'end': 3600}, 'maxrunning': 1, 'jid_include': True, 'run_on_start': False, 'range': {'invert': True, 'start': '05:00', 'end': '18:00'}}, '__mine_interval': {'function': 'mine.update', 'minutes': 60, 'jid_include': True, 'maxrunning': 2, 'run_on_start': True, 'return_job': False, 'enabled': True}, 'enabled': True}, '_stamp': '2023-10-04T11:03:28.141687'}
<snip>
2023-10-04 18:00:00,823 [salt.utils.schedule:32  ][TRACE   ][2806535] ==== evaluating schedule now None =====
2023-10-04 18:00:00,824 [salt.utils.schedule:1762][DEBUG   ][2806535] schedule: Job highstate was scheduled with jid_include, adding to cache (jid_include defaults to True)
2023-10-04 18:00:00,824 [salt.utils.schedule:1768][DEBUG   ][2806535] schedule: Job highstate was scheduled with a max number of 1
2023-10-04 18:00:00,824 [salt.utils.schedule:1792][INFO    ][2806535] Running scheduled job: highstate (runtime missed by 12308.98696 seconds) with jid 20231004160000824656
2023-10-04 18:00:00,830 [salt.utils.process:1145][DEBUG   ][2806535] Subprocess Schedule(name=highstate, jid=20231004160000824656) added
<snip>
2023-10-04 18:00:01,825 [salt.utils.schedule:1630][DEBUG   ][2806535] schedule.handle_func: Adding splay of 545 seconds to next run.
<snip>

This is followed by 15 seconds of sls rendering and library loading, before running dkpg-query and...

2023-10-04 18:00:16,861 [salt.loaded.int.module.cmdmod:438 ][INFO    ][2850891] Executing command apt-get in directory '/root'

So as far as I can tell the splay is completely ignored. At 18:05:45 there's also no run nor any informative logs.

I'm not sure why salt thinks runtime missed by 12308.98696 seconds.

whytewolf commented 1 year ago

check salt-call schedule.show_next_fire_time highstate in the off peek hours. it might be scheduling it in the middle of the interval then delaying until the next open time. which would account for the runtime being missed.

can you add these two settings also jut to make sure.

    run_after_skip_range: False
pckroon commented 1 year ago
root@salt:~# salt minion1 schedule.show_next_fire_time highstate
minion1:
    ----------
    next_fire_time:
        2023-10-06T04:32:57
    result:
        True

This looks sane (the locale for the master is off, hence the 2h difference).

After adding the run_after_skip_range: False, and syncing the pillar data:

    next_fire_time:
        2023-10-06T13:14:32

I'll let you know if the run_after_skip_range changed anything after the weekend

pckroon commented 1 year ago

Unfortunately that doesn't seem to have made a difference: 2023-10-08 18:00:00,825 [salt.utils.schedule:1792][INFO ][2806535] Running scheduled job: highstate (runtime missed by 43031.999353 seconds) with jid 20231008160000825078, followed by an immediate scheduling and execution of a highstate.

pckroon commented 1 year ago

@whytewolf Anything else I could try?

whytewolf commented 1 year ago

so, what is happening is the splay is being recognized and used. as is invert. the issue is that if the time the schedule wants to run is in the blackout period it will hold onto that schedule item until the next available non-blackout. that is why the missed time is so high. because it missed when it should run so is waiting until it can run.

only way i see around the reason for your schedule in the first place would be to schedule an orchestration in the master. and use batching to limit the amount of minions running at a time. this way you can ensure only a select few minions are running. cause the master only tells a select few minions to run at a time.

pckroon commented 1 year ago

Thanks. I'll give the orchestrate route a go. I made the following orchestration sls (which seems to be working as intended):

# /srv/salt/_orch/highstate.sls
highstate:
  salt.state:
    - tgt: workstations
    - tgt_type: nodegroup
    - highstate: True
    - batch: 3

And I'll hook this up to a cron job on the master to run salt-run state.orchestrate _orch.highstate at a convenient time. Is there a better way to get this to run regularly?