Add a new `containers` property to the workflow resources

serverlessworkflow / specification

Contains the official specification for the Serverless Workflow Domain Specific Language. It provides detailed guidelines and standards for defining, executing, and managing workflows in serverless environments, ensuring consistency and interoperability across implementations.

http://serverlessworkflow.io

Apache License 2.0

746 stars 147 forks source link

Add a new `containers` property to the workflow resources #998

Open cdavernas opened 2 months ago

cdavernas commented 2 months ago

What Would You Like to See Added?

Introduce a new containers property to the workflow's resources.

Why Is This Needed?

Currently, both shell and script processes are expected to run within the workflow's environment or platform, which can vary significantly from one runtime to another.

For example:

One runtime might execute a single container per workflow, utilizing embedded processors for each defined task.
Another runtime might opt to run each task in a different container.

This flexibility is crucial, as we should not impose constraints on any runtime, especially regarding their architectural choices. The diversity in implementations for shell and script tasks can lead to undesired side effects or even total incompatibility. Consider the following scenarios:

Shell scripts running in Windows versus Linux environments.
Python scripts executed in Python 2 versus Python 3.

One potential solution is to "jail" these tasks by requiring authors to specify a specific environment for each. While this approach is acceptable, it introduces new challenges, such as limiting the ability to run multiple commands or scripts successively within the same container—similar to the capabilities offered by GitHub Actions. Achieving functionality akin to GitHub Actions would necessitate significant and perhaps unnecessary refactoring of the domain-specific language (DSL).

Proposed Solution

I propose allowing authors to define top-level, reusable containers that can be assigned to script and shell tasks. This approach would:

"Jail" the environment of these tasks.
Enhance reusability, minimizing performance and cost impacts for most use cases.

Additionally, this proposal would address issue #874 by enabling authors to specify which versions of a script they wish to run, without being restricted to specific ones, as suggested in our weekly discussions.

Example Implementation

The following YAML illustrates the proposed structure:

document:
  dsl: '1.0.0-alpha1'
  namespace: default
  name: jail-shell-and-script-tasks
  version: '0.1.0'
use:
  containers:
    alpine:
      image: alpine:latest
do:
  - greetTheWorld:
      run:
        shell:
          command: '@echo "Hello, World!"'
        on: alpine  # Using a predefined container
  - greetTheWorldASecondTimeInSameContainer:
      run:
        shell:
          command: '@echo "Hello again, World!"'
        on: alpine  # Using the same predefined container as before
  - greetTheWorldUsingPython:
      run:
        script:
          code: >
            print('Hello, World!')
        on:
          image: python:3.10  # Using a container defined in-line

Conclusion

With this proposal, script and shell tasks would effectively become shorthand for container tasks and could potentially be transformed into cataloged custom functions. This change would lead to a more consistent and flexible approach for executing tasks within workflows.

cdavernas commented 2 months ago

@JBBianchi @ricardozanini @fjtirado @matthias-pichler What do you guys think?

ricardozanini commented 2 months ago

If I'm correct, have we discussed this possibility in a meeting or thread? That being said, I give +1 to this proposal.

Additionally, we can add that a runtime can have a container definition by default if none is defined. The recommendation is that runtimes make the default versions and architecture clear in their documentation.

fjtirado commented 2 months ago

I think I am old guy and I prefer to call an embedded script within the memory adress space of the process running the workflow definition (Im also pragmatical, there are lot of libraries to run embedded javascript and python in most languages) One of the problems I foresee with container is that, if there is a lot of script calls, the performance cost will be prohibitive. And there also implications about shared memory (which wont be available when running the scrips in an external container) I do not think we really have to made the choice between embedded or "external" execution (for scripts, for shell the invocation has to be external to the process running the workflow). I propose the user to make the choice. If there is not on after the script, we can assume is executed embedded. The problem of the script language version I think is manageable; we can enforce a minimum version of the script to be implemented by the runtime (so users that want to use a newer version or a different script language, use the container approach)

matthias-pichler commented 2 months ago

Enhance reusability, minimizing performance and cost impacts for most use cases.

@cdavernas so you are proposing that these scripts are also executed on the same container instance? Or just the same image?

I am just thinking that if a frequently run workflow that runs a script step in a container, then waits a week and then wants to reuse the same container instance would make it very cumbersome/expensive to keep so many container snapshots around.

embedded script within the memory adress space of the process running the workflow definition

@fjtirado how would you in this case prevent a user script from reading restricted env vars or fetching from local IPs? Examples:

if you have an event store to store workflow state, a script could read the database credentials from the env vars
if running on Amazon EC2 the script could obtain credentials from the Instance Metadata Service (IMDS at 169.254.169.254) and access cloud resources. similar for other hyperscalers

I think ultimately the workflow DSL offers a way to describe control (i.e. task) flow as well as data flow. I would abstain from introducing another "out of band" data flow via e.g. files or shared memory. This would severely limit how a runtime could be implemented and make stateless implementations basically impossible. In my opinion tasks should only communicate via input output defined in the DSL. This would allow to make implementations stateless and isolated, which is super important for multi tenant implementations.

@cdavernas regarding your initial proposal: I think it would be great to make script & shell calls more deterministic by specifying a container image 👍 I think if we allow reuse, it should be limited to the container image not the container instance.

I also see a second benefit: Not only does it allow for consistent versions for e.g. Python but in JavaScript runtimes like node, bun or deno could be specified and thus their internal modules used. Without the container property every runtime could use a different JS runtime and authors would have to rewrite their script code when switching

cdavernas commented 2 months ago

@cdavernas so you are proposing that these scripts are also executed on the same container instance? Or just the same image?

I was thinking of the container, but you raise a valid point, though it is IMHO rather related to the runtime implementation than to the DSL (how long should containers be "kept" alive and whatnot).

Allowing reusing containers, however, enables complex flows that could be performed in a non-sequential way (ex: git clone, do something unrelated such as consuming an event, git checkout without having to restart a whole container and recloning, ...), and would also mitigate the (rightful) concern of @fjtirado concerning performance cost.

Maybe we ought to think of additional properties to address your concern? Such as TTL or whatever?

I think ultimately the workflow DSL offers a way to describe control (i.e. task) flow as well as data flow. I would abstain from introducing another "out of band" data flow via e.g. files or shared memory. This would severely limit how a runtime could be implemented and make stateless implementations basically impossible. In my opinion tasks should only communicate via input output defined in the DSL. This would allow to make implementations stateless and isolated, which is super important for multi tenant implementations.

@matthias-pichler I totally agree with you! @fjtirado as seductive as it may sound, the idea of in process/embedded shell/script processes would, at best, break portability/cross-compatibility of workflows, which is IMHO enough to dismiss it altogether.

matthias-pichler commented 2 months ago

@cdavernas I think the GitHub actions comparison is a good starting point:

GitHub actions is specifically tailored for CI/CD where repositories need to be checked out and thus a lot of file system access has to happen. They offer to write multi step workflows even offering reusable components. But ultimately all "steps" run on the same (container) instance. Another such example is the Common Workflow Language (aka CWL). It is more targeted towards Life Sciences handling big data: Inputs and Outputs are mostly handled as files and workflows are supposed to run in one long running process.

In the middleground there are things like GitLab CI: every "job" (aka "step") runs in it's own container. So the "build" job would run before the "test" job but in completely separate containers and both would need to checkout the repository (yes it is slower than GH Actions). The "build" job would need to specify artifacts that are passed to downstream jobs. Another similar thing would be Dagster that write intermediate results to S3 and passes them between jobs. And each job has to load and write back to S3. This enables restarting from a checkpoint

And then lastly there are more the application level workflow engines, e.g. AWS StepFunctions. Every step is completely stateless and independent from each other. For performance tasks that need the same files would have to be combined into one "longer" running task. Data flow only happens through the input & output of states. Other things in this category are Netflix Conductor, Uber Cadence and things like Pipedream or Temporal because they all have the concept of workers & dispatchers/orchestrators. I would call this category the "General Purpose" workflow engines while the others are "special purpose". And I see Serverless workflows firmly in this category and thus don't think achieving CI/CD specific performance for CI/CD tasks as a main design goal when one needs to specifically design for this.

cdavernas commented 2 months ago

I think the GitHub actions comparison is a good starting point:

It is an AWESOME, near perfect comparison indeed.

And you are perfectly right: while the idea of reusing a container sounds appealing, it does go against concepts such as atomicity and statelessness. Plus, it would, as you pointed out previously, raise multiple infrastructural concerns.

As a matter of consequence, I propose we proceed with "reusable" container configuration (image, env, ...) and forget about "reusable" containers.

On a side note, you raise a very interesting point when speaking about artificats. We had a long conversation with @JBBianchi about it last week, which was concluded by the following observations:

we need to get rid of the volumes property that is currently associated with the container process, as volumes cannot be implemented in a consistent ways accross DSL runtimes and platforms (ex: Docker vs Kubernetes)
we need a concept similar to the one of artifacts, so that authors are able to "share/access" files between tasks
we need a concept/quality of life feature thats allows tasks to access those artifacts, which would allow authors to avoid having to do potentially catastrophically costly operations such as passing around base64 buffers (see https://github.com/serverlessworkflow/specification/tree/main/use-cases/automated-data-backup). This is crucial for tasks that need to upload (large) files, for example.

matthias-pichler commented 2 months ago

On another note: I think we could also utilize use.containers to build a container

document:
  dsl: '1.0.0-alpha1'
  namespace: default
  name: jail-shell-and-script-tasks
  version: '0.1.0'
use:
  containers:
    alpine:
      image: alpine:latest
    node:
      build:
        dockerfile: https://github.com/nodejs/docker-node/blob/0c0069246367ac5ac0fc6bca141fb04faaca2f4b/22/bookworm/Dockerfile

the content of [build] would be inspired by the build property in docker-compose

cdavernas commented 2 months ago

On another note: I think we could also utilize use.containers to build a container

@matthias-pichler Thank you for the suggestion, but I have some concerns about the idea, mainly because this property would only be relevant for a local Docker installation. In a remote environment, or on Kubernetes, simply building the image wouldn't be enough — we'd also need to push it to a registry, which would require additional configuration and possibly authentication. Moreover, the feature you're proposing can already be handled quite easily using a shell script or similar approach.

fjtirado commented 2 months ago

@cdavernas I do not think running multi platform languages like python or js in embedded mode will break portability as far as we define a compliance level for that script language (meaning that we support embedded scripting only for a particular set of languages, not an arbitrary set)
Regarding @matthias-pichler concern about sharing global variables (effectively creating a parallel hierarchy) I agree, but I do not see how this will be prevented if the same container is reused, because the writer of the script might be tempted to share container state anyway. On summary, I do not see any problem with the container proposal, but I do feel it can live together with embedded support for popular scripting languages like python and js.

matthias-pichler commented 2 months ago

I do not think running multi platform languages like python or js in embedded mode will break portability

The following script will only work on Unix like systems. I someone runs the runtime on a windows machine it will fail even though python is a "portable" language

import os

os_info = os.uname()

allowing to specify image: ubuntu:latest would make it work everywhere regardless of host operating system

if the same container is reused

well we could update the specification to say that the container will no be reused, which is what I am proposing. More concretely we should decide what the behavior of the following is

document: {}
do:
  - jsWrite:
      run:
        script:
          language: js
          code: |
            const fs = require('node:fs');
            const path = require('node:path');

            fs.writeFileSync(path.join(__dirname, 'file.txt'), 'Hello, world!');
  - jsRead:
      run:
        script:
          language: js
          code: |
            const fs = require('node:fs');
            const path = require('node:path');

            const content = fs.readFileSync(path.join(__dirname, 'file.txt'), 'utf8');
            console.log(content);
  - pythonRead:
      run:
        script:
          language: python
          code: |
            with open('file.txt', 'r') as f:
                content = f.read()
                print(content)

do we say:

both reads will always fail because state is not preserved (what I recommend, because it is the most secure & scalable and offers less ways to write weird dual-data-flow workflows)
say both reads MUST succeed, effectively forcing that a workflow MUST be executed on one instance and cannot be distributed
Leave it undefined reducing portability

Option 2 would have far reaching implications for example:

document: {}
do:
  - jsWrite:
      run:
        script:
          language: js
          code: |
            const fs = require('node:fs');
            const path = require('node:path');

            fs.writeFileSync(path.join(__dirname, 'file.txt'), 'Hello, world!');
  - delay:
      wait:
        days: 365
  - jsRead:
      run:
        script:
          language: js
          code: |
            const fs = require('node:fs');
            const path = require('node:path');

            const content = fs.readFileSync(path.join(__dirname, 'file.txt'), 'utf8');
            console.log(content);

this would require the runtime to keep (potentially thousands of containers) around for 1 year

matthias-pichler commented 2 months ago

On summary, I do not see any problem with the container proposal, but I do feel it can live together with embedded support for popular scripting languages like python and js.

I don't like embedded support because it exposes such a big security problem

document: {}
do:
  - imds: # gets credentials from the EC2 instance metadata service
      run:
        script:
          language: python
          code: |
            import requests

            response = requests.get("http://169.254.169.254/latest/meta-data/iam/security-credentials/")
  - env: # leaks host secrets in a Lambda function
      run:
        script:
          language: js
          code: |
            console.log(process.env.AWS_SECRET_ACCESS_KEY)
  - container: # breaks out of the container
      run:
        script:
          language: python
          code: |
            import socket

            # Connect to the Docker socket
            sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
            sock.connect("/var/run/docker.sock")

            # Send a request to list containers
            request = b"{\"method\": \"ContainerList\", \"id\": 1}\n"
            sock.sendall(request)

fjtirado commented 2 months ago

I think two irreconcialible schools of thought are collisioning here ;) I come from old 8 bit times and for me using a different container for every script invocation is like using a nuclear bomb to kill three soldier armed with arrows But conceptually you are right.

matthias-pichler commented 2 months ago

Maybe haha. But I think security is job 0 and we should consider this when running author provided code, especially in multi tenant environments. Isolated container instances are one way but I also know of some platforms (e.g Atlassian, Treality, ...) that compiled JS & Python provided by users to WebAssembly which also offers some security boundaries. But even they wrap them in individual Lambda Functions for added security

github-actions[bot] commented 1 day ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.