seqeralabs / tower-cli

Nextflow Tower CLI tool
Apache License 2.0
41 stars 9 forks source link

Add support for Data Explorer functionality (list, add) #381

Open robnewman opened 6 months ago

robnewman commented 6 months ago

Add a new command:

tw data-links

that interacts with the new Data Explorer data-links API endpoint in the Seqera Platform. Some suggested functionality (need to add all the auth syntactic sugar):

tw data-links list --workspace=<workspaceId>                             # list data-links in a workspace
tw data-links list --workspace=<workspaceId> --provider=<cloudProvider>  # subset to a specific cloud provider
tw data-links list --workspace=<workspaceId> --type <cloud|custom>       # subset to auto cloud or custom data links
tw data-links add --workspace=<workspaceId> --name=<dataLinkName> --credentials=<credentials> --description=<description> --provider=<cloudProvider> # add a custom data link to a workspace
tw data-links delete --datalink=<datalinkId> --workspace=<workspaceId>   # delete a datalink in a workspace
tw data-links cp --datalink=<datalinkId>:/path/to/object.txt object.txt  # copy/download a single object (defined by prefix path) from the data link to your localhost 
tw data-links cp /path/to/samplesheet.csv --datalink=<datalinkId> --workspace=<workspaceId> # upload a file from your localhost to the data link
tw data-links cp /path/to/folder --datalink =<datalinkId> --workspace=<workspaceId> --recursive # upload all files in a folder from your localhost to the data link
### Tasks
- [ ] https://github.com/seqeralabs/tower-cli/issues/405
- [ ] https://github.com/seqeralabs/tower-cli/issues/406
- [ ] https://github.com/seqeralabs/tower-cli/issues/407
- [ ] https://github.com/seqeralabs/tower-cli/issues/413
- [ ] https://github.com/seqeralabs/platform/issues/6457
ewels commented 6 months ago

Presumably tw data-explorer cp downloads to the current working directory? Could be nice to support alternative destinations too.. 🤔 Potentially as a separate tw data-explorer sync command, or a flag, or just a second positional argument..

pditommaso commented 6 months ago

For me it's a -1. Why bloating the CLI with this?

ewels commented 6 months ago

You could argue that it's not worth having a CLI at all, there's a perfectly good API!

Having it in the CLI makes it faster and easier to work with datasets from the terminal. It improves developer / user experience.

ewels commented 6 months ago

Also the technical reason for having it in the CLI for downloading files:

Presigned URLs expire after a short-ish window (@swampie thought it was 1 hour). If downloading a large dataset, the download could easily run for many hours. A generated bash script would therefore fail, however the CLI could request the presigned URLs one at a time in series, meaning that they're always fresh and continue to work.

evanfloden commented 6 months ago

Adding a usecase to download/list a dataset, with a flag to download/list the files inside the dataset csv/tsv/table. For example:

tw dataset cp <dataset_id> --files

This downloads the dataset table (csv/tsv) plus the files. In this way the user only has to be concerned with passing around the dataset object, and they can download/list the files at any time. Think a dataset can also be an output so it becomes a packaging mechanism.

Note: Today, the auth to access/download/list files in a dataset is not guaranteed as users can create whatever s3:// paths they want in a csv. This issue also exists when launching a pipeline.

pditommaso commented 6 months ago

Fair enough

swampie commented 5 months ago

for upload and download why using the seqera cli when you can use the standard cloud tooling?

ewels commented 5 months ago
mbosio85 commented 5 months ago

Considering the ongoing work to extend the Data Explorer availability to personal workspaces, these new CLI capability should be implemented for those as well.

swampie commented 5 months ago

I agree with Paolo that the complexity is not justified for the time being: open to discuss

evanfloden commented 5 months ago

Adding a very key point being lost here.

Our end users shouldn't need cloud console or cloud provider CLI access. They likely don't have cloud credentials. This is the point of having different roles with WS admins adding credentials and CEs.

End users want to upload data, run pipelines, and download results.

pditommaso commented 5 months ago

I agree 💯 that CLI should have first-class support. However, my understanding is that the feature highlighted here does not come for free, it may require some specific endpoints.

robnewman commented 3 months ago

Updated original request to match the Data Explorer data-links API endpoint name

robnewman commented 2 months ago

TBD - pagination is always returned by the API, need to account for this in the CLI commands.

jordeu commented 2 months ago

I feel a bit weird about naming this subcommand data-links. I've checked the data explorer's UI, and there, you can upload files without any mention of the "data link" concept. And you can create new "data links" also without any mention of that concept. Why should we use this name in the CLI?

The sub-title where you can list your "data links" says, "Browse remote data repositories and data for use in Seqera Cloud," with no reference to this "data link" concept. Overall, this "data link" concept is misleading.

I'd call it "data source", and then the command line can be tw data-source... with tw ds... alias. Also, the tw data-source add ... subcommand would be more meaningful.

But because naming is difficult and what sounds good to me may sound terrible to others, I suggest reviewing this naming before hardcoding it into the command line interface. Or at least, if "data link" is chosen as the best way of naming it, the web UI should be consistent and call that section "data links" instead of "data explorer" with explicit references to the "data link" concept when you add a new one.

robnewman commented 2 months ago

@jordeu Thanks for the feedback! The Data Explorer API endpoint is called data-links and we were being consistent with that. I think it would be more confusing to have the API endpoint named differently to the CLI interface (when both are publicly accessible). I agree that the term "data-link" is widely used internally but not directly surfaced externally. I would be in favor of explicitly referencing that term in our docs, but open to feedback.

weronikasosnowskaseqera commented 1 month ago

@robnewman we are missing here the method to list content

robnewman commented 1 month ago

@weronikasosnowskaseqera Please add. I wasn't necessarily comprehensive - just that the functionality needs to exist and reflect the API functionality.

canny[bot] commented 1 month ago

This issue has been unlinked from a Canny post: Add datasets directly from s3 / data explorer to the platform :cry: