Users should be able to administer workers across provisioner boundaries

gregarndt commented 7 years ago

Users do not have a general way of administering worker types and workers across provisioner boundaries (some do not even have an actual provisioner).

For AWS provisioned workers, users can see some data about that worker type such as pending counts and instances running for a given worker type. However information specific to each of those specific instances, as well as some general information for a worker type (failure rate, etc) is not available.

Also, given that this UI is specific to the AWS provisioner, this means that any worker will need to have a provisioner that could provide this information as well as another UI to display them.

This RFC is suggesting a few things (all through one common UI):

users should be able to display worker types (along with relevant information) across all provisioner IDs known to taskcluster
users should be able to drill down to display details about a specific worker (uptime, last claim, failure rate)
users should be able to perform actions against those specific workers (reboot, kill, disable)
workers that are disabled will remain alive but no tasks will be returned by taskcluster when the worker calls claimWork. Optionally, the worker could respond to a specific 4xx status code returned by taskcluster to know that it has been disabled. Disabling workers specifically allows the machine to remain alive but not accepting jobs so a user can perform diagnostics/debugging on the machine
rebooting a machine is particularly useful for physical machines that are not killed, but rather need to be rebooted because some way they got stuck. This might mean the machine is powered off/on again in the case of a machine hooked up to a PDU we can speak to, or it might mean it's actually rebooted through some API

gregarndt commented 7 years ago

Copied from https://github.com/taskcluster/taskcluster-rfcs/issues/71#issuecomment-312365779

We discussed a variant of this proposal today in SF. Here's what we determined:

The model is three nested REST entities: provisioner, worker type, and worker

declareProvisioner
    declares a provisionerId (do not expire)
    declares "actions" that can be taken on workers
    may later declare policies for managing those
declareWorkerType
    declares a worker type, with docs and a payload schema
    expires can be set to weeks for "production" workerTypes, an hour for test workerTypes
declareWorker
    declares a worker id, worker type, expires, boot time, and worker metadata (worker.extra)
    expiration is extended by claimWork; an expired worker is deleted

This data is queried as follows:

list/get provisioner (just returning the declared data)
delete provisioner (administrative operation)
list/get worker type (by provisionerId; get returns declared data, worker count, and pending count)
list/get worker (by worker type; get returns declared data, including worker.extra)
get worker last claims (returns a best-effort list of recent taskIds claimed and when they were claimed)

Things that are still undefined:

scopes required for declaration
how actions are defined
    defined at the provisioner level or the worker type level?
    schema for definition?
the stuff above regarding slugids (@jonasfj and @petemoore are discussing this now)

Short-term implementation of this would involve the queue gathering data from claimWork, so that we don't need to modify any workers. Most of these fields will be null.

and from https://github.com/taskcluster/taskcluster-rfcs/issues/71#issuecomment-312368878

This will require putting some data in postgres, but this is not #65. The tables we will need are something like (with * indicating primary key):

Provisioners

provisionerId *

WorkerTypes

provisionerId *
workerType *
docs
payload schema
expires

Workers

workerGroup *
workerId *
provisionerId *
workerType *
extra
boot time
expires
created (set on first declaration)
last claims

jhford commented 7 years ago

Related to this discussion is my comment from https://github.com/taskcluster/taskcluster-rfcs/issues/71#issuecomment-313388095.

Aside from a registry of provisioners, I don't agree with this RFC. Besides registering of provisioners, the information stored within is mostly a duplicate of information stored in the various provisioners, the queue or could better be derived from other sources. Because it's a best effort service and because it does things like TTL on registration, the information in it cannot be trusted to make any real assertions.

Instead, I wonder if a better way would be to log all the tasks created for a given ProvisionerId/WorkerType in the Queue to a datastore. We could store information about which worker claims a given task in the same datastore. If we have a registry of provisioners, we can take the provisionerId from the task definition and make calls to the corresponding provisioner when that information is needed.

The advantage to doing this is that there's no registration. This means that the information is available for every workertype, every worker automatically and without the complexity of registration. It also means that we can trust this information a lot more, not least because it's based on what's actually happening, not an opt-in-best-effort-expiring-registration.

As well, the various (future) provisioners should already know about which workerTypes they have, and for those which do not use a provisioner, a simple "fake provisioner" could be built which does the best-effort-opt-in-registration model.

jhford commented 7 years ago

We had a meeting today and afterwards some of us continued to chat in IRC. We started to come to a consensus about a possible path forward which we agreed on and would enable @helfi92 to start working on the dashboard.

This proposal is divided into two stages, creatively named stage 1 and stage 2. I've taken the liberty to make the proposal a little more concrete. Stage 1 is a concrete proposal we should look at implementing shortly if we agree on it. Stage 2 is one possible way that we could solve some of the requirement to have payload schema and documentation more available.

Stage 1

We will begin inferring currently known provisioners, worker types, workers from claimWork and reclaimTask. When a call to claimWork/reclaimTask is made, the Queue would update a last-used field in a datastore for each of:

provisionerId
provisionerId:workerType
provisionerId:workerType:workerGroup
provisionerId:workerType:workerGroup:workerId

We would then add the following endpoints to the Queue API. These endpoints would only list those resources which had last been seen in the wild within the last 5 days:

GET /provisioners/ to list out provisionerIds that have been seen. Output would be a simple list of strings
GET /provisioners/:provisionerId/worker-types to list out worker types for a given provisioner. Output would be a list of strings
GET /provisioners/:provisionerId/worker-types/:workerType/workers to list out a list of workerGroup/workerId string.

Maybe we could make the format of the lists be lists of objects in the form: {id, lastSeen} so the dashboard has even more information.

Nothing else would be added to the API, since all the information is able to be inferred accurately from the queue endpoints.

Stage 2

This would involve allowing creation of supplementary metadata for a given provisionerId:workerType. Adding this supplementary metadata would be seen as purely optional. Those worker types which have metadata would be considered 'stickier' because of the added effort, we would include them in the apis noted in Stage 1 above for 30 days instead of only 5 days.

A point of clarification is that a worker type would not be included in the stage 1 endpoint responses unless it had been seen in the wild within the last 30 days. This means that even a worker type with metadata would not be returned in the list unless it had been seen in the last 30 days. The purpose of this is to ensure that worker types which have gone extinct don't clutter up our dashboard.

The following endpoints would be added:

PUT /provisioners/:provisionerId/worker-types/:workerTypes/metadata. This would add metadata for a provisioner/workerType regardless of whether it has been seen. with a document in a format which has not been defined, but could be like:
```
{
payloadSchema,
documentation,
}
```
DELETE /provisioners/:provisionerId/worker-types/:workerTypes/metadata
GET /provisioners/:provisionerId/worker-types/:workerTypes/metadata

jonasfj commented 7 years ago

Those worker types which have metadata would be considered 'stickier' because of the added effort,

This is really nice.. We have dummy workertypes when we test worker implementations. So having them disappear is nice :)

djmitche commented 7 years ago

I agree. I kinda want less than 5 days for dummy workerTypes, but we can debate time periods once stage 1 is underway. At the moment I'm most concerned to hear objections to stage 1.

petemoore commented 7 years ago

I like it!

I wonder if we should start a separate component for this, since the queue already has quite a large scope of responsibility:

storing tasks
deciding on work assignment (which tasks for which workers)
artifact storage and retrieval
task scheduling based on task dependencies

Not sure what we'd call this new component. The only (bad) names I could come up with are:

Audit
Auditor
Register
Bookkeeper

(I'm thinking simple names like we have for Queue, Index, Scheduler, Auth)

petemoore commented 7 years ago

Related to stage 2, I found this 2+ year old bug: Bug 1164203 - generic-worker: Upload usage.js to references.t.n/workers///v1/ which documents some early discussion on the topic of registering payload schemas.

jonasfj commented 7 years ago

Since we want to infer most of this from the claimWork end-point, I think it's natural that it lives in the queue. If we have cyclic dependent services we're often better off without micro services.

djmitche commented 7 years ago

The list of workerTypes is, in essence, the list of queues -- so that at least makes a lot of sense in the queue service!

petemoore commented 7 years ago

So technically it would be pretty simple for something to listen on the queue's pulse exchanges to do all this work, which would separate out the logic to a different component, which could easily be dropped/replaced/adapted without the queue being affected. But if I'm outnumbered, fine! My concern is we could be building a very big service that becomes difficult to untangle later (i.e. a couple of years down the line) due to the numerous responsibilities it has.

djmitche commented 7 years ago

The "Decided" tag here represents stage 1. We can re-visit, or make a new PR, for stage 2.

helfi92 commented 7 years ago

RFC #86 has been created for Stage 2. Let's continue the discussion for that there.

djmitche commented 6 years ago

This RFC is stored as rfcs/0082-Users-should-be-able-to-administer-workers-across-provisioner-boundaries.md

taskcluster / taskcluster-rfcs

Users should be able to administer workers across provisioner boundaries #82

Stage 1

Stage 2