Debug pages for agent and server

amoore877 commented 4 years ago

It would be nice to have debug pages for Server and Agent, perhaps showing

running status
uptime
some basic statistics (number of registrations in DB / number of SVIDs in memory, for instance)
current SPIFFE ID and expiration timestamp
perhaps parent CA (if applicable) SPIFFE ID and expiration?

This Issue can be used to discuss what might be wanted on a debug page, which can be developed / expanded over multiple PRs.

evan2645 commented 4 years ago

I would be super hesitant to build something like this directly into SPIRE Server and Agent. My main concerns are around access control, attack surface, etc.

That being said, there have been a handful of use cases in the past for some sort of socket-based management API on the agent. This API would be authenticated, but the workload accessing it should be able to obtain an SVID from the workload API anyways. So, one solution which might address my above concern is to introduce new authenticated APIs on the server and agent, and if a status page or something similar is desirable, to run a small webserver adjacent to the server/agent that knows how to speak to these APIs.

One question I might have is, who would be accessing these status pages, and from where? Would they be authenticated? How is that managed? Etc.

amoore877 commented 4 years ago

The very basics of this (running status, uptime) seem to be addressed by the existing health endpoint. If we're concerned about opening a new attack surface, perhaps the rest of the information could be rolled into that as custom information?

As far as who would be accessing: An implied part of this is "from where"; ideally, debug / health info is available theoretically from "anywhere" (barring network or host security restrictions). The other suggested information is not particularly sensitive, though the DB registration count (if implemented) and other potentially expensive operations would have to be something periodically updated rather than live. If there is concern about DoS via the call in general, all of the new information could be periodic. It is then much the same as spamming the existing health endpoint today.

While Server instances by their nature should probably be deployed to dedicated/restricted hosts and thus perhaps only accessed by a maintaining team, it is also information that integrating services or teams or bootstrap steps may have an interest in. During some other alert or issue, being able to quickly make a single call for either sanity checking or commonly vital information is helpful. This is particularly the case for Agents. If someone is bringing up a new service or host or owns an existing one and SVIDs are no longer being issued, it would be helpful to have this information available both for the maintaining team to quickly get more information and better yet for the initial finder to triage themselves (possibly avoiding a need to contact maintainers). As scale increases it becomes more difficult to be able to assert the health and state of every single Agent instance so adding a method to get more information, particularly for users, makes the Agent more maintainable as a whole.

As far as authentication, it's probably clear at this point I believe debug and health information should be unauthenticated, though of course as mentioned above that means care has to be taken with what information is actually available and what the cost is of handling the request. If we're trying to see health or debug info for a SPIRE component, there's a good chance we don't have a SPIRE identity to go with the call, and maintaining other authentication methods for this within SPIRE feels out of scope.

azdagron commented 4 years ago

I think there is room for a local-only view of debug information. We'd have to be careful to not provide any sensitive information over said view. If this was over UDS, we could restrict this info (or maybe just provide extra info) to root callers. This kind of information would likely be stuff folks could already glean from logs (albeit more conveniently), which should be scrutinized against logging sensitive information.

Like evan, I think that opening this up to remote callers has additional security restrictions that for me at least, outweigh the benefits. The agent, which carries the private keys of workloads, shouldn't add attack surfaces without careful scrutiny and obvious gain.Agents for example, don't currently have any remote-accessible API surface. Adding one just to get remotely what you can get via a local process doesn't feel worth it.

If we added a local API, there would be nothing to stop operators who are willing to accept the risk from building their own sidecars for obtaining SPIRE health information in a way that is best ingested by them.

amoore877 commented 4 years ago

Agreed, the host-local approach would be best. Think we'd have to have a listing of info deemed acceptable to show on the debug page what items would quality as "only root should see this" to determine whether there should be both a root and non-root page, or just one of those.

azdagron commented 4 years ago

To scope this, I think we should start with just the local-only API to return the debug information in lieu of returning an actual debug HTML view. The latter can be revisited later.

MarcosDY commented 4 years ago

I created a PR with a proposal for debug endpoints, it consist on 2 protos:

Server
Agent

In both protos my idea was to provide like a summary of actual state of server/agent, and both are open of any local call we may request and SVID in case of agent in order to allow access to debug) maybe we can go further and request an admin SVID (or a new byte for it) if we are worried about security.

About use cases for agent I can think in:

user want to know svids are propagated without reading all logs
user wants agent's SPIFFE ID (and when it rotates)
user wants server that SPIFFE ID and when upstream rotates

And for server it is more like a summary of how many entries / federations it have.

@amoore877 do you have some use cases we must satisfy with our debug endpoint? and do you think actual protos can satisfy what you wanted?

amoore877 commented 4 years ago

The proto looks good to me (small comment below in first case).

@amoore877 do you have some use cases we must satisfy with our debug endpoint? and do you think actual protos can satisfy what you wanted?

An organization has a service that monitors host health fleetwide, and spire agent's health is considered a part of overall host health. It should be able to tell that: 1) the agent is up, 2) the agent has been able to sync with server, 3) the agent is continually able to sync with server (not expired or in danger of expiry). I think the uptime, expires_at, and svid_count parts satisfy that. I suggest a "last successful sync with server timestamp" may also be helpful.
Maintainers of spire in an org want to know how many registrations and nodes are part of the ecosystem without needing to do the slightly awkward CLI + pass in socket param operation + consolidating to a true count with terminal piping magic. I think that's satisfied in the server proto.
Maintainers of spire in an org want to allow some level of self-service for a large variety of host owners to triage their spire agents before escalating to the team. I think the proto provides sufficient basic information to allow this.

Related, but should be a different PR, is an overall uptime metric as described in #1231.

maybe we can go further and request an admin SVID (or a new byte for it) if we are worried about security.

I don't know if this would be necessary, at least for the present information. I could see the argument from an aggressive security standpoint for a config to disable the endpoint, and/or a config to require auth or not. If there are concerns about the information exposed, there should be a discussion on how the information could be used by an attacker who is already on the host to more easily break or exploit something.

spiffe / spire

Debug pages for agent and server #1320