Closed ricomnl closed 2 years ago
Hi @ricomnl thanks for posting this.
We have been planning on breaking out the debug=True
for AWSBatchExecutor
as a distinct executor DockerExecutor
. So basically option 1. We're motivated to do this for helping people test other upcoming executors like k8s
. We would use volume mounts to access a local path as the scratch path.
As what can be done currently, technically, s3_scratch_path
doesn't need to be s3
, since all file io is through File
internally. So you could use a local path as the scratch path. We also allow configuring volume
mounts in @task
, but unfortunately not in redun.ini
(although that's an easy addition). I haven't tested this but if you made your scratch path the same path on the host and container, the file staging might just work as is. However, a new executor docker
that does these kind of config by default would likely be much easier for users.
Thanks for the ideas and contributions are welcomed.
Thanks for the feedback @mattrasmus! Do you already have something sketched out for the DockerExecutor
? If not, I'm happy to contribute that feature.
I'm planning to add a GoogleCloudLifeSciencesExecutor
in the near future as well so this would be an easier start.
Regarding your suggestion with the current hack: using matching host and volume paths combines with the volume mounts seems to get further than what I had gotten before but eventually it enters the submit_command()
function which does some copying via s3 and fails:
https://github.com/insitro/redun/blob/main/redun/executors/aws_batch.py#L566
My set up is simple: I'm inside the 05_aws_batch example folder and am running:
redun run workflow.py count_colors_by_script --data data.tsv --output-path $(pwd)/redun
with the following redun.ini
:
[...]
[executors.docker]
type = aws_batch
image = redun_example
queue = test
s3_scratch = /Users/ricomeinl/Desktop/projects/redun/examples/05_aws_batch/redun
job_name_prefix = redun-example
debug = True
And the function is adapted to:
[...]
@task(
executor='docker',
volumes=[(
"/Users/ricomeinl/Desktop/projects/redun/examples/05_aws_batch/redun",
"/Users/ricomeinl/Desktop/projects/redun/examples/05_aws_batch/redun"
)])
def count_colors_by_script(data: File, output_path: str) -> Dict[str, File]:
"""
Count colors using a multi-line script.
"""
# Here, we use the same script as in 04_script, but now we do File staging
# to and from S3.
output = File(output_path + "color-counts.tsv")
log_file = File(output_path + "color-counts.log")
return script(
f"""
echo 'sorting colors...' >> log.txt
cut -f3 data | sort > colors.sorted
echo 'counting colors...' >> log.txt
uniq -c colors.sorted | sort -nr > color-counts.txt
""",
executor="docker",
inputs=[data.stage("data")],
outputs={
"colors-counts": output.stage("color-counts.txt"),
"log": log_file.stage("log.txt"),
},
)
Just wanted to share an update. We are preparing a new docker
executor that will support this behavior (i.e. using a local scratch dir accessible by volume mount). This is simplifying efforts in implementing other executors like a k8s-based one in development. I can share more soon when we're ready to post it to the public repo.
Sounds great! Thanks for the update @mattrasmus .
@ricomnl I just pushed an update that includes DockerExecutor
.
The docs are here: https://insitro.github.io/redun/config.html#docker-executor There is a quick example of how it works here: https://github.com/insitro/redun/tree/main/examples/docker
Let me know if this addresses your need for this ticket.
Awesome great stuff! Thanks @mattrasmus
Being able to run workflows locally with Docker is a huge advantage of Nextflow over other workflow tools in my opinion. I saw there is a mode (
debug=True
) to run local Docker containers in redun as well but it relies on the S3 scratch space.I think we can easily add a Docker executor which would allow folks to run pipelines fully cloud agnostic, using local Docker containers for tasks. There are two options: 1) Add a Docker executor and use the volume mount to mount local folders (I'm assuming we can just process files without staging them). 2) Fewer changes: using the AWS Batch executor, one can use the
debug=True
flag to use local Docker containers to run the pipeline. To overcome the S3 dependency, one can use a locally hosted minio. The only changes we'd need to make is to add theendpoint_url
parameter to the boto S3 client in two places infile.py
:Change https://github.com/insitro/redun/blob/main/redun/file.py#L462 to:
Change https://github.com/insitro/redun/blob/main/redun/file.py#L448 to:
Let me know what you think. I'm happy to draft a quick PR to make it happen.