Propose a Kick Server API Schema for Monitoring Service Status

freeformflow commented 9 years ago

As services go about their activities they can curl requests to the kick server:

During startup, they can update their progress and what's happening.
When they're online and ready, they should tell the kick server.
And when they fail, we can use a curl request with the ExecStop directive to signal when a service has failed.

Please propose a schema the kick server can use to collect this information from services.

PandaWhisperer commented 9 years ago

As discussed on IRC, it would be helpful to get more details from @dyoder.

Specifically:

What kind of status information do we want to store? Just basic "stopped", "starting", "running", "shutting_down"?
What's the idea with applications vs micro services? I may need a refresher on that topic. Unfortunately the relevant Wiki page isn't done yet. This is important for how the API is structured, however (i.e. do we need nested resources?)
@PandaPup brought up that we don't want to store this state on the kick server, but rather pass it through to the API server. That would mean the kick server would need to know its API server. Or, it could cache it and the API server would ask for that information.
Is there a possibility we might want to use this facility to relay more finegrained, application-specific status information in the future? I'm thinking stuff like load, requests handled, database size and so on. I.e. could this become a general facility for monitoring cluster health down the line, perhaps with a dashboard like New Relic?

If I think of anything else, I'll add it here.

freeformflow commented 9 years ago

@PandaWhisperer

I think there should be a distinction between "stopped" and "failed" if we can detect that.
I haven't written down my thoughts on the wiki about this. I will draft it here. The CAM model stands for Cluster, Application, Microservice (sometimes we just call this a service). We wish to gracefully handle the non-trival task of describing a generic cloud deployment. Other attempts at this have gotten lost in non-human-readable amounts of configuration and are not very approachable. CAM offers a much more lightweight solution without loss of generality.
- Each component of the model is the child of the one named before it. Applications are children of Clusters and Microservices are children of Applications. Each level is self describing and observes a scoped configuration. This avoid collisions and unexpected configuration effects.
- The Cluster is the platform we're running on. The physical machines as configured on the cloud platform. For now that's CoreOS on EC2, but the CAM model doesn't require it. It could be any Linux cluster.
- The Application is the user's software they wish to deploy. It is represented by a repository on the user's local machine and can be deployed to the cluster(s) of the user's choosing. A given instance of the Application is a child of the Cluster, but there may be instances on many Clusters simultaneously.
- The Microservice is the smallest unit of the deployment. Microservices are represented by mixins in the launch/ directory of the Application's repository. Microservices are deployed as containers on the Cluster. Microservices are children of the Application in the sense they are all deployed together and share a "global" configuration scoped at the Application-level. We care about their deployment status as a group, however we don't restrict how they are connected. The Kick server allows us to establish an arbitrary network topology among containers, with each Microservice self-describing how it would like to be connected to the container network. At the moment we are using Docker containers, but the CAM model just requires Microservices run in Linux containers, of which there are several competing offerings.
My statement was in reference to Issue #40, where Dan suggested we shouldn't make the Huxley API ask for information from the Kick server. Instead of a "pull", he prefers a "push" where information is updated to the central API as soon as it becomes available. We can configure the Kick server with the URL of the Huxley API server. We'll just need to modify panda-cluster as part of Alpha 03. It shouldn't be that big of a deal. For this ticket, you only need to get a single service to respond to the kick server reliably. Issue #70 is actually more related to the problem of shipping information to the Huxley API.
We are definitely tending toward more complexity, and you've laid a good foundation by rewriting the kick server in PBX, but we should keep the Kick server as simple as possible to get through Alpha 03. Let's stick to the states listed above and the ability to pass along an arbitrary string as an accompanying message.

PandaWhisperer commented 9 years ago

Thanks @PandaPup for the detailed information.

I think no. 2 is a great draft for that missing wiki page, you can almost just copy and paste it there, at least for starters.

Regarding no. 3, I read the referenced ticket and comment and I don't see @dyoder expressing a specific preference for pushing the information, he just mentioned it as an alternative. I suppose this would be a good time to think about the tradeoffs involved in each approach.

For instance, what if the API server goes down or is unreachable for some period of time? We could implement a retry strategy, but if too many status updates are sent in a given time, it might bog the kick server down. And in a way, we'd still end up caching the information until we can send it off.
Also, while it might be possible to configure the kick server with a reference to the API server when it's first set up, what if that changes? Is that possible? I.e. can a cluster be reassigned to a different API server? If yes, we'd have to plan for that contingency.

Finally, I understand that a cluster may run more than one application. This would be reflected in the structure of the URLs, such as /application/:app_name/service/:service_name. Does each microservice know the name of the application it is a part of?

PandaWhisperer commented 9 years ago

As for the API design, I'm currently thinking of doing nested resources, as mentioned above. I.e. a URL scheme of /application/:app_name/service/:service_name, where each service is part of an application. We'll just have RESTful actions on each resource that the services can use to update their status.

Now, I just realized that the status will change over time, so perhaps we'll want to store that as well, then we could have /application/:app_name/service/:service_name/status, which we could POST to in order to add a new status. The server would just add the status to the given service object, automatically adding a timestamp. Statuses would be stored in an array in the order in which they are received (i.e. chronological).

This is a bit heavy, however, with all those nested resources. Alternatively, we could flatten the hierarchy by rolling all services into the application they belong to, so we'd have /application/:app_name/status which we'd POST to, but now we'd have to include the service name along with the status, and let the kick server sort it out.

I'd like @dyoder to weigh in on this proposition if possible.

dyoder commented 9 years ago

I've placed a detailed description of what we need to implement on our internal wiki. Ideally, that will turn into a bunch of individual tickets.

pandastrike / huxley

Propose a Kick Server API Schema for Monitoring Service Status #67