Workflow proposal and application improvements

dhiguero commented 3 years ago

Hi everyone, following the last community call, I am adding this issue to reflect an alternative point of view with regards to the presented workflow proposal. The key elements that we feel could improve the overall next release of OAM would be:

Application lifecycle hooks
Workflow
MutatingTraits / TemplateTrait / RenderTraits (Not the best name, but captures the spirit)

While Application lifecycle hooks and MutatingTraits can be extracted to their own proposals, we feel at this point that adding both to this document may facilitate the broader discussion about the workflow.

Application lifecycle hooks

Following the same approach as we found in Kubernetes at the pod level where part of the lifecycle have hooks defined for PostStart and PreStop events, there may be a benefit adding this type integration mechanism at the application level.

This can be used to trigger processes that help with common tasks performed on the application improving composition of applications. The initial list of proposed hooks could be:

BeforeStart: Before rendering the components execute a workflow. For example, this can be used to prepare storage.
AfterStart: After the application has started (status ok at application level), execute a workflow. For example, register the new application for internal discovery.
OnUpdate: After a new revision is applied execute a workflow. For example, to migrate the database schemas.
OnError: In case of a component crashing, execute a workflow. For example, to trigger a notification.
AfterFinalization: After the application has been removed successfully, execute a workflow. For example, to cleanup external resources.

Workflow

We propose a workflow to capture a simple process consisting on several tasks executed in order. The workflow would be a new top level entity similar to application. In this alternative, the workflow is not embedded in the application because we think there may be use cases where the workflow will have tasks affecting several applications. Moreover, a workflow in itself could represent and ordered execution of individual tasks with the expectation that they will finish (e.g., a processing pipeline) which is not the case of an application.

Some use cases that we consider:

Using a workflow to trigger periodic tasks: Assume a scenario where some maintenance tasks need to executed periodically. For example, perform a backup on a set of tables.
Preload database schema/migrations: In this case the workflow could have a set of tasks that connect to the database and set the initial schema or perform a migration to a newer schema if required.
Capture processing pipelines: While the expectation of an application/service in the current form is that once deployed they need to keep running, by adding a workflow where a Task is expected to finish opens up the possibility of using workflows to define the logic of a processing pipeline. For example, let's assume an image processing pipeline, where each task may apply a different filter/process to the dataset.

To illustrate how this entity could be used, the following example describes a simple workflow to trigger a migration of a database.

kind: Workflow
metadata:
  name: update-db
spec:
  tasks:
    - type: wait-for-db
      description: Waiting for db to be up
      properties:
         connection_string: ....
    - type: update-schema
      description: Updating schema to new version
      properties:
         connection_string: ....
         update_scripts: ...
    - type: migrate-data
      description: Migrating data to new tables
      properties:
         connection_string: ....

We also consider the possibility of having a tasks to deploy an application referencing it either by an URL or other type of identifier.

  tasks:
    - type: launch-qa-environment
      properties:
         app: <reference to the application to be deployed>

With respect to task definition, and approach similar to the proposed one with WorkflowStepDefinition renamed as TaskDefinition could be used.

Other examples

Image processing workflow

Use case: Retrieve a dataset, and process it in pipeline. ``` kind: Workflow metadata: name: image-processing spec: tasks: - type: retrieve-dataset description: Retrieving dataset from source properties: from: .... to: pvc-XYZ - type: apply-filter description: Reducing noise properties: filter: A dataset: pvc-XYZ - type: apply-filter description: Edge detection properties: filter: B dataset: pvc-XYZ ```

Blue-Green version rollout (case 2 in workflow_policy proposal)

Use case: Deploy a new version of the app in two clusters. ``` kind: Workflow metadata: name: blue-green-rollout spec: tasks: - type: deploy-application description: Rendering new version properties: app: dry-run: true render_output: - type: deploy-application description: Deploying new version properties: app: partition: 50% successThreshold: ... - type: apply-filter description: Promotion properties: manualApproval: true rollbackIfNotApproved: true ```

Multi-cluster deployment (case 1 in workflow_policy proposal)

Use case: Deploy an application in two clusters. ``` kind: Workflow metadata: name: remote-deployment spec: tasks: - type: deploy-application description: Rendering new version properties: app: dry-run: true render_output: - type: remote-deploy description: Deploying on cluster A properties: app: clusterSelector: east replicas: 70% - type: remote-deploy description: Deploying on cluster B properties: app: clusterSelector: west replicas: 30% ```

Orchestrating apps (case 3 in workflow_policy proposal)

Use case: Wait for an application to be deployed before starting the other. ``` kind: Workflow metadata: name: dependent-applications spec: tasks: - type: deploy-application description: Deploying database properties: app: - type: hook-wait description: Waiting for database to be up properties: app: hook: AfterStart - type: deploy-application description: Deploying on wordpress properties: app: patch: to: field: spec.containers[0].envFrom[0].secretRef.name valueFrom: apiVersion: database.example.org/v1alpha1 kind: MySQLInstance name: my-db field: status.connSecret ```

MutatingTraits / TemplateTrait / RenderTraits

One of the major problems related to trait usage is related to the execution delegation to the associated operators. In situations where several traits are applied to the same component, it is possible to find scenarios where several operators are competing to modify the target component elements (e.g., the generated deployment), which can translate into undesired reboots of the application with the addition of uncertain "startup" times.

As some of elements in the proposed workflow may be used to tackle this issue, an alternative approach could be to add a new type of trait which will be applied in order and be provided with the initial rendering of the component and must return the modified one. Consider a simple interface as:

// Apply the trait to an incoming entity or list of entities separated by ---
Apply(input string)(*string, error)

In this approach, the runtime will process first this type of traits applying one after the other in order, before rendering any of the components. Once all of those traits are applied, the rendered objects will be created in Kubernetes or on the target runtime. Notice that this also will remediate conflicts in scenarios where two traits may be trying to change the same parameter.

apiVersion: core.oam.dev/v1beta1
kind: Application
metadata:
  name: my-example-app
  ...
spec:
  components:
    - name: backend
      type: company/test-backend # test-backend is referenced from other namespace
      properties:
        debug: "true" # debug is a parameter defined in the test-backend component.
      renderTraits:
        - type: logging-export // assumes it adds a new sidecar in Kubernetes
        - type: scaler // will modify the replication factor
          properties:
            replicas: 4
        - type: add-sidecar-B // adds another sidecar
        - type: add-sidecar-C // adds another sidecar

In the previous example, the component backend will be modified by adding a sidecar to export the logs, then the replication factor will change to 4. After that, two new sidecars will be added before rendering the component into the runtime.

With respect to how implementable is this approach. If I remember correctly Pulumi uses this internally and defines a gRPC interface which enable operators to register in the runtime and be invoked whenever the trait is applied.

Open Questions

Data sharing on the workflow: There are two type of data that we may be interested into. First on the simplest side, it may be useful to include a method to pass content through variables following a similar approach to the one found in GitHub actions. On the other side of spectrum there is also the need to sharing data files. For that we could either rely in the recommendation of using a type of persistent storage that will be mounted on each task, or try to capture this extra complexity on the design.

wonderflow commented 3 years ago

Very good proposal! I have several questions:

What will the Application lifecycle hooks works like? Could you give some yaml examples. I'm really interested with it.
In KubeVela we embed the Workflow section into Application, so users can use Application as a unified UI object, and we can avoid conflicts. For example, 1) different workflows conflict with each other. 2) Application without workflow has it's own reconcile logic, which could conflict with the outside workflow.
About the traits (MutatingTraits / TemplateTrait / RenderTraits), I think we should not expose these complex concepts to users.
- In kubevela, we define patch inside the TraitDefinition to solve RenderTraits problems. The controller will gather all patch traits and apply their patch to the workload https://kubevela.io/docs/platform-engineers/cue/patch-trait

hongchaodeng commented 3 years ago

Hi @dhiguero . Thanks for raising up the issue.

I have looked through your proposal. These things have been encountered, raised, and thought through before. The end result is the Workflow design. Let me share more thoughts on these items:

Application lifecycle hooks...

The problem is that there is no unanimous consensus on Application Start/Stop/Health/Error. Instead, this will be handled in Workflow -- workflow steps can be extended to provide customized hooks. Here is an example:

workflow:
- type: BeforeApplyHook
  properties:
    container: ...
    web: ...
- type: apply-components
- type: CheckHealthHook
  properties:
    onHealthy: ...
    onFailure: ...

We propose a workflow to capture a simple process consisting on several tasks executed in order.

We share similar problems and solutions. But we shouldn't add a Workflow API. Because entire concept should focus on Application. When user delivers apps, the Application is the single entry point and delivery object. Everything revolves around delivering the Application.

Additionally, if we provide a Workflow API, how is it different from other projects like Tekton or Argo? We shouldn't reinvent the wheel and build another Workflow API. We should just glue Argo/Tekton APIs into KubeVela via the Application concept.

Using a workflow to trigger periodic tasks

This could be defined as a Component.

Preload database schema/migrations

This could be defined as a Trait, underlyingly could be run as a job. We are planning to mark special traits that won't be applied by the generic applier. It will look like:

kind: Application
spec:
  components:
  - name: web
    traits:
    - type: db-migration
      # special mark: don't apply it in generic applier
      disableGenericApply: true

workflow:
# Apply the db-migration trait first. Then apply other components.
- type: apply-single-trait
  properties:
    name: web
    type: db-migration
- type: generic-apply # this will apply all components and traits except the disabled ones.

Capture processing pipelines:

This could be defined as a Component. We can just make it a Tekton pipeline component.

MutatingTraits / TemplateTrait / RenderTraits

This is actually defined in the CUE template. YAML is not quite native to describe scripting process. CUE is the best tool to handle these tasks.

dhiguero commented 3 years ago

Thanks @wonderflow, @hongchaodeng for your comments. I will try to answer the different questions.

With respect to application lifecycle hooks, it could look like:

apiVersion: core.oam.dev/v1beta1
kind: Application
metadata:
  name: my-example-app
  ...
spec:
  lifecycle:
    beforeStart:
      task:
        type: task-name
        description: Execute XYZ before starting the app
        properties:
          ...
    afterStart:
      ...
    onUpdate:
      ...
    onError:
      ...
    afterFinalization:
      ...

I think that from the point of view of the spec, it is actually possible to define what we mean by those lifecycle states so that there is a consensus with respect to the expectation of what will be the behavior in different runtimes.

With regards to the workflow, I think it deserves an entity in itself and thus this proposal to compare other points of view different from the one presented already in Kubevela. It is true that similar entities already exists on the Kubernetes landscape. However, I think this is a proposal for the OAM spec that is container/runtime agnostic, so adding a workflow entity in OAM will benefit other target runtime/frameworks and will provide a standardized expectation of what a workflow is from the point of view of OAM. In terms of particular implementations, I agree we should reuse other available technologies, but that is a discussion on the implementation level for a particular target runtime.

Related to using components as tasks, the main issue I found is the expectation of a component inside an application to be always on, and not be a job-like execution. Similarly, using a trait does not guarantee when the effect will take place as those are not linked/executed in specific lifecycle event of the application entity.

About Mutating/Template/Render Traits, my proposal is not to embed scripting, but to enable some traits to be applied in a deterministic order. Similarly to the workflow I think it is useful to expose this element to the users as it has a specific meaning/expectation. Kubevela may have a specific Trait that is treated differently, but this is on the runtime level. That trait will not necessarily be available on other runtimes, and I think this type of traits not only cover patching operations. For example, imagine a render trait that preconfigures the number of replicas based on the history of the component and forecasted requests. That type of approach will require logic that I think it is outside of the patching scope and CUE.

oam-dev / spec