reanahub / reana

REANA: Reusable research data analysis platform
https://docs.reana.io
MIT License
123 stars 54 forks source link

IDEA: upload remote files #666

Open agy-why opened 2 years ago

agy-why commented 2 years ago

Dear developers,

I have a question regarding the reana-client upload feature.

Is it already possible or is it planned to execute something like:

reana-client upload s3://.../my-s3-data/

And if yes which services are you currently supporting? scp, ftp, s3, google cloud, webdav,...

I thank you in advance,

Yori

agy-why commented 2 years ago

I realized this may not be the proper place to ask. Shall I open this issue for reana-client ?

tiborsimko commented 2 years ago

Hi @agy-why, this repository is a perfect location for this issue, there is no need to move it.

Currently, we don't support remote storage services in the above suggested way. What is possible is that the researchers can express remote file access needs by special stage-in and stage-out steps in their computational workflow graphs. That is, the first step of the workflow would be the download of inputs from S3, and the last step of the workflow would be uploads of results back to S3. For a live example, please see EOS stage-out example in the documentation: https://docs.reana.io/advanced-usage/storage-backends/eos/

We support virtually any external storage system where we can use Kerberos authentication or VOMS proxy authentication mechanisms. Examples include EOS or WLCG sites. Note also that we are in the middle of adding support for Rucio, see https://github.com/reanahub/reana-auth-rucio/issues/1

That said, we have been planning to support remote file syntax sugar in a rather similar way as you suggested. We thought of allowing a syntax like:

inputs:
  files:
    - s3(`mybucket`, `myfile.csv`)

REANA would then do an automatic stage-in and stage-out for this file. One advantage is that researchers wouldn't have to write explicit data staging steps in their DAG workflows.

This is a bit similar to Snakemake support for remote storage, see https://snakemake.readthedocs.io/en/stable/snakefiles/remote_files.html and the examples therein for AWS or S3 in Snakemake rules.

We hope to start working on similar remote file storage support syntax sugar sometime this winter.

tiborsimko commented 2 years ago

P.S. Another related idea I should note that we have been thinking about is to add support for popular protocols so that REANA workspace could be manipulated via tools such as rclone. This might simplify initial stage-in upload and final stage-out download, especially when using many files or when using very large files.

agy-why commented 2 years ago

Dear Tibor,

thank you for your clear and detailed response.

My personal use-case would be to have a single workflow that could work with various data origins: my dev-data are on a server that I can access via scp, my prod-data are on a private s3 infrastructure but they may move to another one (not necessarily s3) after publication of the results.

Therefore I would found useful to be able to specify not only the source but also the protocol to access the data outside of the workflow (git repo).

Currently, I need to implement two variants in my first step (get_data) to get the data in the work space, which I can chose via input parameters. It is fully acceptable that way, but I'd greatly appreciate the rclone feature you suggested.

This would allow me to plug-in / plug-out input data to the same workflow by populating my workspace accordingly.

agy-why commented 2 years ago

An alternative would be to be able to mix workflows together, I don't know how far this is possible.

I have:

It would an acceptable solution for me to be able to propagate the WorkSpace of one of the get_data workflow to the analyse_data WorkSpace. Or to create a new workflow from, say: get_data_from_scp + analyse_data.

Is this already possible?

I thank you in advance.