insitro / redun

Yet another redundant workflow engine
https://insitro.github.io/redun/
Apache License 2.0
520 stars 45 forks source link

Can't stage out directory trees #9

Closed dakoner closed 2 years ago

dakoner commented 2 years ago

I have a batch job that runs a script which writes a directory tree to the local filesystem, and I want to stage the output tree using the outputs=[] option to script(). I don't know the directory tree contents (it could change based on arguments to the script). I basically want to do the equivalent of cp -r but it looks like outputs are staged using cp.

Is there a way to defer computing the outputs() list to after the script has run? In that case I could run something like glob.glob(output_dir + "/**", recursive=True) to get a full list of output files and have those mirrored (respecting the path structure under the directory tree).

Otherwise, I'd ending up putting this at the end of the script: aws cp --recursive output_dir s3://my-output-bucket/final-data/

mattrasmus commented 2 years ago

Thanks @dakoner for the question.

Did you take a look at Dir(remote_path).stage(local_path). Does that achieve what you want?

You should be able to use it like this:

@task()
def run_prog(input: File, out_s3_path: str) -> Dir:
    return script(
        f"""
        prog local_file --output local_dir
        """,
        inputs=[input.stage("local_file")],
        outputs=Dir(out_s3_path).stage("local_dir"),
    )
dakoner commented 2 years ago

Thanks, that solved the problem! I only saw Dir() being used directly in a task (https://github.com/insitro/redun/blob/main/examples/06_bioinfo_batch/workflow.py#L580) I tried Dir() as an outputs=[] and it worked perfectly.