nf-core / tools

Python package with helper tools for the nf-core community.
https://nf-co.re
MIT License
232 stars 187 forks source link

nf-core download should support private repos #2406

Open adamrtalbot opened 1 year ago

adamrtalbot commented 1 year ago

Description of feature

Currently, nf-core download breaks on private repos, mainly because it tries to download some files as compressed zips. We should support a private Github repo where possible, authenticating using the existing Github mechanisms or general Git (e.g. run git clone && zip). This mainly affects pulling the repository contents itself but may affect configuration. This will enable users of private repos to stage them to an offline area.

ewels commented 1 year ago

huh, interesting. Does this URL not work for private repos? https://github.com/nf-core/tools/blob/e5ce6ce20304835bd40f102f038b7e1aadc888b2/nf_core/download.py#L384

I guess it's because we're doing a plain download. I wonder if we could get it via the (authenticated) API instead.. 🤔 Or at least add in some headers to the download to provide auth.

adamrtalbot commented 1 year ago

I'm not certain but I think it just does a requests.get

ewels commented 1 year ago

Yeah exactly, sorry I phrased it badly. I just meant does that URL exist with authentication on private repos, or is it just not there at all?

I think it's quite easy to extend requests.get to use auth. Maybe even as simple as requests.get(url, auth=(username, token))..? (see Stack Overflow)

We already have the username and token in code (if we were able to find it, eg. by the user having the gh cli installed or by an env var). So then it would be trivial to add that in, and it would work with private repos. Maybe.

MatthiasZepper commented 1 year ago

Does it work on private repos with the --tower flag?

Because for Tower downloads, it runs git clone https://github.com/{self.pipeline}.git and self.pipeline is e.g. nf-core/rnaseq or whatever. I do recall, however, that I had to use https://, because git clone git@github.com:{self.pipeline}.git always failed in the tests on GithubActions due to authentication issues (it basically does not allow for anonymous downloads). If we would start using SSH for every download, everyone would need to sign up at Github and create a SSH keypair, just to download a public pipeline. Thus, we should have a --private flag for the command to switch between the two methods?

adamrtalbot commented 1 year ago

Ahhhh I looked at the code some more. So if the tower flag is false, you download the repo as a zip, but if it's true you clone it . That's clever stuff.

adamrtalbot commented 1 year ago

So is there any reason not to use --tower?

MatthiasZepper commented 1 year ago

So is there any reason not to use --tower?

The main reason is the single cryptic package output that does not allow for easy config additions.

When you tw launch a file:/ pipeline, a temporary copy of the pipeline is created in the runfolder and then executed by Nextflow. This makes any config adaptions quite challenging, because the profiles or config files are part of the .git package.

So for every pipeline (and revision!), I need to get the respective configs and test files one by one, adapt the paths in them to the offline locations and then save the output in a separate directory for Tower to access (or submit them to tw datasets)

git --git-dir="$DEPLOY_PIPELINES/rnaseq.git" show latest:conf/test.config | sed "s=https://raw.githubusercontent.com/nf-core/test-datasets=$DEPLOY_TESTDATA=g" > "$DEPLOY_PARAMS/rnaseq/test_miarka.yml" 

In contrast to that, the "classic" download is much more user-friendly. It creates a self-contained directory with the folders config, singularity-images and each pipeline revision. For each revision, the nextflow.config is already updated such that custom_config_base points to "${projectDir}/../configs/" and singularity.cacheDir to "${projectDir}/../singularity-images/":


├── 3_12_0
│   ├── assets
│   ├── bin
│   ├── conf
│   ├── docs
│   ├── lib
│   ├── modules
│   ├── subworkflows
│   └── workflows
├── 3_9
│   ├── assets
│   ├── bin
│   ├── conf
│   ├── docs
│   ├── lib
│   ├── modules
│   ├── subworkflows
│   └── workflows
├── configs
│   ├── bin
│   ├── conf
│   ├── docs
│   └── pipeline
└── singularity-images 

Since you can access everything with a file manager and inspect/edit the file contents with an editor, it is easily customizable.

In contrast, if you want to customize a released pipeline from a --tower download, you would need to commit the changes, delete the original tag and then move the tag to the newly created commit. So without Tower and its ability to e.g. manage sample sheets separately with tw datasets, a pipeline downloaded with --tower is just unwieldy.

stevekm commented 4 months ago

Why is this hard-coded to only support GitHub? There are plenty of other git hosting platforms that people may be using.

ewels commented 4 months ago

There are a few reasons why the pipeline download code specifically is quite tied to GitHub:

If there is interest in supporting other git hosting platforms for nf-core download, we could look into that. For example, trying to fetch the zip file and falling back to a git clone (and rm -rf .git?) if the download fails. That could also be a feasible solution for this issue, if we can assume that the user already has access to the private git repo configured locally.

MatthiasZepper commented 4 months ago

If there is interest in supporting other git hosting platforms for nf-core download, we could look into that.

I hate to be a spoilsport, but my personal interest in looking into that is relatively low and since I was virtually the only one writing code for nf-core download in the past ~2 years, I don't see how this would materialize soon.

All nf-core pipelines are on GitHub. Ensuring that they download correctly, including the containers, is already a cat-and-mouse game and even harder, if there is absolutely no standardization or uniformity in how the pipeline is built. Here be dragons is the level of liability I am willing to assume for anything that happens outside of nf-core. I shrug, if somebody tries downloading a non-nf-core pipeline and then complains about inappropriate defaults (#2892), but I fear more of those issues will pop up, if we open the floodgates a little wider by supporting additional git hosting platforms.

That being said, nf-core download is in dire need of a full rewrite, so if you, @stevekm, feel like getting invested into that, I am happy to provide guidance, review PRs and help out.

The coarse roadmap is, that downloading pipelines and associated containers should become separate functions #2408. The pipeline download would then be harmonized to use only the WorkflowRepo class, which increases the flexibility to support other Git hosting platforms as well as private and local repos (#2406/#2407). Tidying up the superclass SyncedRepo (#2940) is a paramount prerequisite to handle this reliably and cleanly with as little code duplication as possible. Once this is done, both the classic and the --platform downloads will in the background be channeled through this class. Instead of downloading the zip files of the releases, the repo will be checkout out for each revision into a temporary folder and after performing the config edits copied to the output directory.

Getting the separate container download to work will be more difficult, because a WorkflowRepo obviously contains all revisions and commits, but the information, which of them are relevant, is not readily available anymore. At the moment, the script downloads all container images that it can find in the specified revisions, so regardless of the chosen parameters, the image will be available. nextflow inspect, in contrast, requires all the configs and parameters to run. If you download n revisions of a pipeline, you might need to provide n revision-specific param files because of changing parameters. Because of this complexity, I am inclined to preserve the current functionality as a fallback option (but move the code over to the new CLI subcommand). In either way, we need to look into supporting Docker images (#2309), improving Singularity/Apptainer support (#1320, #2020,#2357), making it to play nicely with the new Wave community registry.

All of this would be nice to have, but I basically only get round to develop features that we as a facility need to have for our egoistic purposes (#2938). Anything that we will never use ourselves unfortunately has a relatively low priority. Writing a pipeline might at least result in a paper and a few citations ✌😏.