Running redun pipelines locally with Docker

ricomnl commented 2 years ago

Being able to run workflows locally with Docker is a huge advantage of Nextflow over other workflow tools in my opinion. I saw there is a mode (debug=True) to run local Docker containers in redun as well but it relies on the S3 scratch space.

I think we can easily add a Docker executor which would allow folks to run pipelines fully cloud agnostic, using local Docker containers for tasks. There are two options: 1) Add a Docker executor and use the volume mount to mount local folders (I'm assuming we can just process files without staging them). 2) Fewer changes: using the AWS Batch executor, one can use the debug=True flag to use local Docker containers to run the pipeline. To overcome the S3 dependency, one can use a locally hosted minio. The only changes we'd need to make is to add the endpoint_url parameter to the boto S3 client in two places in file.py:

Change https://github.com/insitro/redun/blob/main/redun/file.py#L462 to:

[...]
client = _local.s3_raw = boto3.client("s3", endpoint_url=endpoint_url)

Change https://github.com/insitro/redun/blob/main/redun/file.py#L448 to:

[...]
client = _local.s3 = s3fs.S3FileSystem(anon=False, client_kwargs={"endpoint_url": endpoint_url})

Let me know what you think. I'm happy to draft a quick PR to make it happen.

mattrasmus commented 2 years ago

Hi @ricomnl thanks for posting this.

We have been planning on breaking out the debug=True for AWSBatchExecutor as a distinct executor DockerExecutor. So basically option 1. We're motivated to do this for helping people test other upcoming executors like k8s. We would use volume mounts to access a local path as the scratch path.

As what can be done currently, technically, s3_scratch_path doesn't need to be s3, since all file io is through File internally. So you could use a local path as the scratch path. We also allow configuring volume mounts in @task, but unfortunately not in redun.ini (although that's an easy addition). I haven't tested this but if you made your scratch path the same path on the host and container, the file staging might just work as is. However, a new executor docker that does these kind of config by default would likely be much easier for users.

Thanks for the ideas and contributions are welcomed.

ricomnl commented 2 years ago

Thanks for the feedback @mattrasmus! Do you already have something sketched out for the DockerExecutor? If not, I'm happy to contribute that feature. I'm planning to add a GoogleCloudLifeSciencesExecutor in the near future as well so this would be an easier start.

ricomnl commented 2 years ago

Regarding your suggestion with the current hack: using matching host and volume paths combines with the volume mounts seems to get further than what I had gotten before but eventually it enters the submit_command() function which does some copying via s3 and fails: https://github.com/insitro/redun/blob/main/redun/executors/aws_batch.py#L566

My set up is simple: I'm inside the 05_aws_batch example folder and am running:

redun run workflow.py count_colors_by_script --data data.tsv --output-path $(pwd)/redun

with the following redun.ini:

[...]

[executors.docker]
type = aws_batch
image = redun_example
queue = test
s3_scratch = /Users/ricomeinl/Desktop/projects/redun/examples/05_aws_batch/redun
job_name_prefix = redun-example
debug = True

And the function is adapted to:

[...]

@task(
    executor='docker', 
    volumes=[(
        "/Users/ricomeinl/Desktop/projects/redun/examples/05_aws_batch/redun", 
        "/Users/ricomeinl/Desktop/projects/redun/examples/05_aws_batch/redun"
        )])
def count_colors_by_script(data: File, output_path: str) -> Dict[str, File]:
    """
    Count colors using a multi-line script.
    """
    # Here, we use the same script as in 04_script, but now we do File staging
    # to and from S3.
    output = File(output_path + "color-counts.tsv")
    log_file = File(output_path + "color-counts.log")

    return script(
        f"""
        echo 'sorting colors...' >> log.txt
        cut -f3 data | sort > colors.sorted

        echo 'counting colors...' >> log.txt
        uniq -c colors.sorted | sort -nr > color-counts.txt
        """,
        executor="docker",
        inputs=[data.stage("data")],
        outputs={
            "colors-counts": output.stage("color-counts.txt"),
            "log": log_file.stage("log.txt"),
        },
    )

mattrasmus commented 2 years ago

Just wanted to share an update. We are preparing a new docker executor that will support this behavior (i.e. using a local scratch dir accessible by volume mount). This is simplifying efforts in implementing other executors like a k8s-based one in development. I can share more soon when we're ready to post it to the public repo.

ricomnl commented 2 years ago

Sounds great! Thanks for the update @mattrasmus .

mattrasmus commented 2 years ago

@ricomnl I just pushed an update that includes DockerExecutor.

The docs are here: https://insitro.github.io/redun/config.html#docker-executor There is a quick example of how it works here: https://github.com/insitro/redun/tree/main/examples/docker

Let me know if this addresses your need for this ticket.

ricomnl commented 2 years ago

Awesome great stuff! Thanks @mattrasmus

insitro / redun

Running redun pipelines locally with Docker #30