Agent Spawner daemon - Githubissues

praiskup commented 1 year ago

There are use cases like in OpenScanHub/Kobo, where workers/resources are self-standing privileged "agents" that decide what to do about themselves (compete with other agents WRT taking jobs, have privileges to modify a shared database, etc.). This is a different use case than Copr has, where the workers/resources are just non-privileged dummy VMs controlled from the outside via SSH.

In such Agent-like use cases, it's typically possible to guess the ideal number of workers we should have allocated (by introspecting queue, currently running tasks, etc.). This number should correspond to the number of tickets taken from the Resalloc system.

To help with maintaining such "agent-like" resources, we could abstract this problem into an "AgentSpawner" daemon doing this loop:

tickets = []

while True:
    N = ask_the_outside_world_for_the_ideal_number_of_tickets()
    todo = N - len(tickets)
    if N > 0:
        StartNewAgents(N)
    if N < 0:
        StopAgents(-N)
    blocking_sleep_period()

def StartNewAgents(to_be_started):
    for i in range(to_be_started):
        ticket = get_ticket(tags)
        tickets.append(ticket)
        data = wait_for_ticket(ticket)
        # configure worker keys, other tokens, etc, registers the worker
        # into the OSH database, e.g.
        call_TAKE_hook(data)

def TryToStopAgents(to_be_stopped):
    stopped = 0
    for ticket in tickets:
        if stopped >= to_be_stopped:
            break

        data = check_ticket(ticket)
        if call_RELEASE_hook(data):
            # safe to remove the worker, because the release hook
            # succeeded (atomic operation)
            close_ticket(ticket)
            tickets.remove(ticket)
            stopped += 1

Configuration could look like /etc/resalloc-agents.yaml:

osh_workers:
    tags: ["x86_64"]
    cmd_converge_to: /usr/local/take-ideal-number-of-OSH-workers
    cmd_take: /usr/local/take-worker
    cmd_try_release: /usr/local/try-to-release-worker

praiskup commented 1 year ago

@siteshwar @kdudka what do you think about this?

kdudka commented 1 year ago

I think this should work. The tricky part from user's point of view will be to implement RELEASE_hook in a reliable race-free way, as we discussed yesterday.

praiskup commented 1 year ago

Indeed. I'm not familiar with the OSH Worker's logic enough to make some useful guidance, but the whole point is to implement a script that just attempts to stop (the OSH worker daemon, so it doesn't take new jobs), or fails. Answering "am I actually doing something" shouldn't be a dilemma for the OSH Worker.

kdudka commented 1 year ago

@siteshwar I think the RELEASE_hook could work like this:

set max_load on the specific worker (identified by hostname) to 0 in order to make sure that it does not pick tasks any more
check whether the worker has any tasks assigned a. If yes, fail the hook. b. If no, decommission the worker in OSH.

siteshwar commented 1 year ago

@siteshwar I think the RELEASE_hook could work like this:

set max_load on the specific worker (identified by hostname) to 0 in order to make sure that it does not pick tasks any more

Can this be set as soon as the worker reports a running task to hub? So that we could ensure it never gets another task.

kdudka commented 1 year ago

I am not sure to be honest. Long time ago I needed to set max_load to 2 in order to make tasks with sub-tasks actually work in OSH. But it could have been due to a bug of kobo that has been fixed since then. You can give it a try and, if it works reliably for VersionDiffBuild and ErrataDiffBuild tasks with 1 or 2 sub-tasks, I am fine with that.

FrostyX commented 1 year ago

I have comments only for the implementation details

Configuration could look like /etc/resalloc-agents.yaml

I think the configuration could be part of pools.yaml because even though agents behave differently, there is still a pool of them. Having this in pools.yaml should also give us some features (e.g. multiple pools of agents) for free, and not re-invent the same config options that we already parse.

Distinguishing a pool of agents from our current workers could be done by either

agents: true
resource_type: agent
Checking whether cmd_take is defined

Also, it may be a good idea to have this as a part of our standard "loop". It should allow us to have a combination of cmd_new to provision a new agent, cmd_take to use and re-use it, cmd_try_release (we may use existing cmd_release) to release it for another ticket, and cmd_delete.

All of this may be obvious, and you already have it figured out. But just in case ...

praiskup commented 1 year ago

I think the configuration could be part of pools.yaml

I thought this could be a separate config, and actually even a separate package/daemon. Because this would be rather a "client" helper thing (not necessarily running on the same host). I'm also a bit afraid of a logical mixup of "tickets" with "resources" (resource is started by ticket which is taken by another resource == agent). Hmm, worth considering anyway, thank you for bringing this up.

praiskup commented 1 year ago

Just for the record, a dummy proof-of-concept (which could evolve into a real patch, if considered useful) is in #125.

praiskup commented 11 months ago

The PR #125 merged and released. Closing.

praiskup / resalloc

Agent Spawner daemon #123