Closed sungo closed 6 years ago
The basic CRUD api is complete. We need CLI support now which is being tracked under joyent/go-conch#23
Requirements are shifted, rendering this design inappropriate. Shelving the project until a new design can be developed.
Conch needs a centralized system for controlling various processes inside the Conch environment, ranging from centralized management of production datacenter inventory validation to localized management of Conch relays and livesys images inside an integration facility.
In the most general terms, the orchestration system should be a purpose-built workflow engine. It must provide a way to specify policies, focused on build operations, that determine the order of operations for the Conch relay and livesys applications. The validation system must be used to determine if those policies have completed successfully.
The user interface, whether HTML or CLI, is not described in this RFD and will be specified at a later point.
Concepts
Automation
Automation is the execution of individual tasks, with the goal of simplifying or standardizing tasks that were often run manually previously.
Workflow
A workflow is a set of pre-defined automations launched by a trigger condition. In the most general case, steps in a workflow may be optional or required and it is usually possible to nest workflows.
Workflow Engine
A workflow engine is a system that evaluates trigger conditions, trigger individual workflows, and processes the results. Usually workflow engines are purpose-built for the needs of a specific application.
Orchestration
Orchestration is concerned with bringing workflows together into processes or policies, with the goal of streamlining and reusing those processes. Automation provides the building blocks upon which orchestration processes are built.
Design
Workflows
Orchestration operations are tied to a 'workflow', which itself is tied to a hardware product. Every device is keyed to a hardware product, as well. This combination allows for workflow steps and validations to be linked to specific hardware revisions.
This is particularly useful when a newer version of a server design varies wildly from an older one. Both systems can be built and validated because their workflow, and the validations necessary to green-light the device, are tied to the hardware product specification.
Workflow Steps
Workflow steps are an ordered list of string names and validations. The string names are opaque to the orchestration system but signal different operations to the downstream clients.
Status
Two types of status exist, one for the execution of an entire workflow, and another for the execution of an individual step.
Workflow Status
Workflow status records the state of the execution of an entire workflow for a specific device. Most of the time, a device's workflow will either be "ongoing" or "completed". However, special circumstances may arise with a need to interrupt that flow.
If an external entity (probably a human) determines that a workflow must be stopped, the workflow status for that device will be set to 'abort'. The engine cannot reach out to client devices and forcibly halt execution so 'abort' indicates that workflow must cease when the current step finishes. When the device completes its current step, the workflow status for the device will be set to 'stopped'.
Similarly, an external entity may determine that the extraordinary circumstances have passed and a workflow may continue from a previously aborted state. In this case, the workflow status for that device will be set to 'resume'. When the device resumes work, the workflow status for that device will be set to 'ongoing'.
Workflow Step Status
Workflow step status records the state of a particular workflow step for a particular device. From a client perspective, they are write-once. However, a status record must also contain the results of the appropriate validation. As such, the backend must be able to update an existing status record.
When a workflow status is received with a status of 'complete' and 'data' is present, the validation subsystem will be called, using the validation plan id. The validation system call will be passed the relevant device id, hardware product id, validation plan id, and the data from the workflow status. The result will be written back into workflow status record, indicating pass/fail status and a link to the full validation result.
By default, if validation fails for a workflow step, the client will receive a failure indicator when requesting its next step. If 'retry' is set on the workflow step, the client will be told to run that step again. Any retry-able step must also set a maximum amount of retries. Clients will not be allowed to proceed further once that maximum has been reached.
It is possible for the validation system to fail internally, providing neither a fail nor success indicator. When this occurs, step retries must not happen and the step must be marked as failed. It should be possible to re-execute the validations of a step in this state, as long as the failed step is the most recent one. When a revalidation occurs, a new status record must be written containing the new validation result.
Schema
workflow
workflow_status
enum workflow_status_enum:
workflow_step
workflow_step_status
enum workflow_step_state_enum:
enum validation_status_enum:
Implementation
The orchestration API will exist within the existing Conch Mojo API codebase and feature an independant user interface. A standalone CLI will be developed in Go.
Auth
Authentication
The API will be divided into two segments, isolating authentication concerns. For endpoints used by automation, authentication will occur via HTTP Signatures, utilizing RSA keys generated and managed by a CLI tool. Users must be allowed multiple RSA keys and it should be possible to bake an RSA key into the orchestration CLI.
API endpoints used by user interfaces will use the same authentication as the conch API server.
Authorization
Authorization will be managed by the existing concept of roles within the Conch database. For automated clients, permissions will be based on roles within the GLOBAL workspace. For human clients, permissions will be split. Administrators in the GLOBAL workspace will be able to see all workflows, workflow statuses, and devices. Otherwise, the user will see items based on the device list for their particular workspace.
Only Administrators in the GLOBAL zone can created or modify workflows.