Allow listing of client-local allocations / tasks / task groups.

stswidwinski commented 2 years ago

Proposal

When using a localhost http client to communicate with the local Nomad node, I can expect a subset of RPC calls to not leave the localhost (as defined here: nomad/rpc.go at main · hashicorp/nomad · GitHub). This allows me to get information about a particular allocation (by its ID) present on the Nomad node such as resource usage and similar.

However, this set of calls does not allow me to discover the allocations that are known to the local node at the time of the query, forcing discovery to go through the Nomad Server (either a replica if we allow stale data or a master if we do not as defined in the routing policy: nomad/rpc.go at 3f67b5b8ebd78673e2f431f7822f60af53a6efea · hashicorp/nomad · GitHub).

The proposal is to add a node-local query which surfaces the currently known allocations (alongside a set of basic identifiers such as task, task group, job , namespace name and basic information such as status of the allocation) to the caller without forwarding the query off-box.

This would be a logical equivalent of https://www.nomadproject.io/api-docs/nodes#list-node-allocations for a given node served from the node-local state.

NOTE: For initial discussion on the topic please see https://discuss.hashicorp.com/t/nomad-client-allocation-discovery/44357

Use-cases

It is not uncommon for leaf-nodes of a distributed system (here: Nomad client servers) to perform auxiliary operations such as monitoring, logging and metric aggregation. Such operations naturally should keep their scope constrained to the local node (host) and not spill outside of it since otherwise

The number of leafs in the system (here: Nomad clients) increase the load on the root (here: Scheduler(s)), which may result in a DDoS in case of synchronization (more concretely: if each node queries the scheduler for node-local data and all of the nodes happen to synchronize, one can effectively self-DDoS)
Client-server network partitions cause node-local operations to fail with no recourse (even though, theoretically, node-local operations don't have to be affected)
Leaf nodes (Nomad clients) may become partitioned from the scheduler(s), yet not from logging or metrics aggregation systems causing impaired visibility into the actual state of partitioned nodes.

Attempted Solutions

The "best" solutions that I can muster using the existing APIs breaks the abstraction of an API either by peeking into data that is clearly meant to be "private" or "unstructured":

One may peer into the nomad data dir to list the directories within the alloc dir for IDs present there. However, this will only work for allocations which have made it thus far and for which Nomad Client has managed to create disk-bound artefacts. The permissions on the alloc dir is also 0611 which clearly shows that this is not meant for public consumption.
One may also abuse the metrics endpoint which contains most of this information within labels without promising the structure or the persistence of the settings. Here is a quick-and-dirty jq query which accomplishes the task:

curl -s --cacert <CA_CERT_PATH> --cert <CERT_PATH> --key <KEY_PATH>  https://localhost:4646/v1/metrics  | jq '.Gauges[].Labels | if type == "object" and has("task") then .alloc_id,.task,.namespace else null end' | grep -v null
...
"10999fc9-c255-7728-1ba6-b3358c6ef240"
"workload"
"workload-testing"
"c4371047-b100-4ec8-4e17-b1c986d0b661"
"auxiliary-workload"
"auxiliary-workload-testing"
...

With a little bit more creativity we can sort, uniq and produce a bound list of task, job, namespace, alloc_id tuples. The problem with this approach is that local state may contain non-running allocations which will not cause emission of metrics leading to inaccuracies.

stswidwinski commented 2 years ago

@jrasell, thanks for tagging and triaging! I'm curious if you have a sense of when the roadmapping / triaging / discussion would happen (weeks / months)?

If accepted by the team, I'm happy to provide a PR to facilitate the change and save myself from working around the issue at hand. Let me know what you think

tgross commented 1 year ago

I've got this on my radar and wanted to jot down some thoughts on how this could be implemented.

We currently have a few existing HTTP APIs that can get information about specific allocations from the client directly without going to the server. These are all under the /v1/client API and including Read Allocation Stats, Read File, GC Allocation, etc. When these requests hit a client they get routed to the ClientAllocRequest handler which maps them to client RPCs handled in client/alloc_endpoint.go.

Those client RPC handlers have access to the entire client state and a cache of ACL permissions. For example, a handler could do the following to get a list of all the live allocrunners (untested, handwavy, and probably broken code with request/response types that don't exist yet):

func (a *Allocations) ListAllocationStates(args *SomeRequest, reply *SomeResponse) error {

    aclObj, err := a.c.ResolveToken(args.AuthToken)
    if err != nil {
        return err
    }

    a.c.allocLock.RLock()
    defer a.c.allocLock.RUnlock()

    allocStates := []*state.State{}
    for _, ar := range a.c.allocs {
        if aclObj == nil && aclObj.AllowNamespaceOperation(
            ar.Alloc().Namespace, acl.NamespaceCapabilityReadJob) {
            allocStates = append(allocStates, ar.AllocState())
        }
    }

    reply.AllocStates = allocStates
    return nil
}

Then tasks on the same host with the appropriate Workload Identity could hit the Task API socket to read the allocation states in the namespaces they have access to.

So this is technically fairly straightforward, we just need to design the space of operations we want to expose. Do you have any specific thoughts on that @stswidwinski? ListAllocationStates and/or ListAllocations seems like it gets us a lot just by itself, and then you could follow up with the other existing local operations.

stswidwinski commented 1 year ago

Thanks for a reply and putting some thought into this!

So this is technically fairly straightforward, we just need to design the space of operations we want to expose.

This is good to hear (and seems to square up with my own understanding of the implementaion)!

Do you have any specific thoughts on that @stswidwinski? ListAllocationStates and/or ListAllocations seems like it gets us a lot just by itself, and then you could follow up with the other existing local operations.

I think ListAllocations which returns some small set of information about the allocations alongside more targeted ReadAllocation (more-in-depth but more targetted) would be a great start. The more bike-sheddy part of the conversation would likely revolve around the exact fields to include in which type of a request.

Off-the-top-of-my-head and without thinking about this very in-depth, it seems that the the set of identifiers for each object assigned to the agent (maybe a tuple of job_name, namespace_name, task_name or something similar) would lend itself to the list RPC while everything else would fit into the more specific RPC (allocated resources etc.). I could see the state being part of either of the two RPCs and forms a great example of something to bike-shed.

I have had relatively good experience in other contexts with APIs which allow the requester to specify the level of detail that the response should include. For instance, the request could add includeAllocationState which would cause the server to populate a series of optional fields which are otherwise omitted. I haven't seen significant precedence for this within Nomad, so I'm not sure that this will be useful, especially in the first iteration, but throwing it out there just in case it's useful.

EDIT: Since writing this comment, I have found some existing precedent for the describe include* style flags. See https://developer.hashicorp.com/nomad/api-docs/allocations#task_states for an example.

Then tasks on the same host with the appropriate Workload Identity could hit the Task API socket to read the allocation states in the namespaces they have access to.

I think it's also interesting to think about the handling of non-workload identities / tokens. For instance, if a localhost entity has access to a given token (say one which yields global list and read access), would we respect it at the client level (perhaps with a TTL and periodic invalidation) or would that cause a round trip to the scheduler?

I am not sure if you include this case within the Workload Identity model above (and the ACL cache). Feel free to tell me that this is silly and it's all the same if that's the case!

tgross commented 1 year ago

Leaving a cross-reference to #18077 here, which is similar inasmuch as it would require all the same plumbing in the client agent to make work.

hashicorp / nomad