Handle module / process imports

ewels commented 4 years ago

Lots of people use nf-core pipelines offline. We want to make the process of using modules from a different repository as simple as possible.

One solution would be to use git submodule to add nf-core/modules as a git submodule to every pipeline. By default, doing git clone will not pull the submodules. Doing git clone --recursive or git submodule update --init --recursive will pull the module repository.

Loading logic could then be:

Try to load the files locally - works if submodule is initialised. Fails otherwise.
If fails, try to load from the web
If fails, exit with an error

Then by default most people running online will pull the online files dynamically. But pulling a pipeline to use offline is super easy and does not require any changes to files or config.

Currently nf-core download manually pulls institutional config files and edits nextflow.config so that the pipeline loads these files. This could also be done with submodules as above, without any need to edit any files.

Limitations would be that we have to manage the git hash of the modules repository in two places - the git submodule file and the nextflow.config file. We can lint to check that these two are the same. Also, this forces pipelines to use a single hash for all modules in the pipeline. I think this is probably ok for reasons of maintaining sanity though.

Thoughts?

drpatelh commented 4 years ago

A naive implementation is mentioned here https://github.com/nf-core/modules/issues/3, and follows the current procedure we use for nf-core/configs.

Excuse my simple brain but would the git submodule approach allow us to also dynamically obtain the latest hash for the modules repo so we dont have to update it manually in nextflow.config (or point out that it needs to be updated as part of the linting)?. For example, when we release a pipeline the hash will need to remain static to use those particular versions of the module files.

We will definitely need to have a more fine grained control over versioning for modules compared to the configs repo.

ewels commented 4 years ago

No, we will still have to manually update it in nextflow.config for the remote loading to be done (step 2 in my workflow above, if not cloned recursively). This is as your example in #3.

Prepping an example now, will make more sense when you see it hopefully 👍

ewels commented 4 years ago

Example pipeline repo: https://github.com/nf-core/dsl2pipelinetest

To be combined with code in #9

ewels commented 4 years ago

Ok, so discussion with @pditommaso on gitter refers to this [link]

it's not a good idea put the modules as a subtree in the project but in a nutshell, it's like having the complete copy in your working tree but still you have the ability to sync with the remote one

So the suggestion is that we never load remote files here - we just always include the entire nf-core/modules repo in every pipeline.

Pros:

No change to current behaviour - git clone and nextflow pull is super simple
Will work offline without any issues

Downsides:

Repos could get pretty big, if nf-core/modules gets big
..?

I personally like this 😉

pditommaso commented 4 years ago

A possible problem including nf-core/modules is that you will need to update all modules altogether, that still could be a strategy.

if you want to control the version of each module independently you should include each of the as a separate substree.

ewels commented 4 years ago

So one major downside - with git submodules you have a nice .gitmodules file that explicitly shows what commit hash you currently have:

[submodule "modules"]
    path = modules
    url = https://github.com/nf-core/modules.git

Subproject commit a88b867498d783a84ec659017ed94ee2acaaa22b

With git subtree everything is in one repo, so it's much more difficult to inspect at what commit the modules repo is at. I think the only way to do it is by inspecting the commit messages in git log...

ewels commented 4 years ago

A possible problem including nf-core/modules is that you will need to update all modules altogether, that still could be a strategy.

Yes - I think that having everything at one commit is a sacrifice worth making for simplicity though. In fact I'd prefer to enforce this as otherwise stuff could get very confusing quickly.. 😝

ewels commented 4 years ago

Downsides of submodules is that people using git clone will hit problems as they have to use --recursive. Nextflow can handle this with nextflow pull, so that's fine. The GitHub zip file download probably also won't work (used by nf-core download as well as being a visible button on the repo web pages).

apeltzer commented 4 years ago

Can we find a way to resolve the latter issue somehow else with nf-core download? E.g. by locally doing a git pull recursive / using submodules, packaging that up in nf-core download and providing it to the user for copying it over?

ewels commented 4 years ago

Yes, it's pretty easy to fix with nf-core download by refactoring that code 👍 - but the download button on GitHub won't work still. Fairly minor problem though. I think I'm most keen on the submodules now with the more explicit and traceable lock on the modules remote repo. I think it will just be too easy to mess stuff up with the subtree 😟

apeltzer commented 4 years ago

I agree - don't care too much about the Github download button either as we provide a proper alternative and can document that as well :+1:

ewels commented 4 years ago

@aunderwo - I'd be curious to hear your thoughts on this one! Just reading your blog post where you mention git subrepo..

junjun-zhang commented 4 years ago

A possible problem including nf-core/modules is that you will need to update all modules altogether, that still could be a strategy.

if you want to control the version of each module independently you should include each of the as a separate substree.

We are taking a different approach to import remote modules that addresses the above concern and does allow us to version control each module independently. Here are the modules, and here is how these modules are imported into the pipeline repository, basically they are materialized locally (needs to run something similar npm install to import/sync module files).

Since the module files are local, same as other normal files in the pipeline's git repo, once sync'd and committed to git, there is nothing additional needed to make the pipeline work.

ewels commented 4 years ago

Hi @junjun-zhang,

This is definitely an interesting idea.. So this would involve creating a new subcommand for the nf-core helper tool I guess, which would manage the copying / updating of process files. I guess we could also add lint tests to ensure that the code for the processes is not modified away from the upstream repository at all. It would certainly make the pipelines easier to read / understand too..

Phil

apeltzer commented 4 years ago

Yes, a very interesting approach! The only drawback I see with it is, that we need to use a separate tool/method to handle this. A user that only uses nextflow on a cluster that e.g. cannot be configured by the user might have some issues as such an installation involves talking to IT / HPC admins first... which the plain usage of Nextflow does not require. Other than that I can only follow the comment from Phil - would make a lot of standard things easier ...

ewels commented 4 years ago

No, I'm not sure that this is correct - here the nf-core helper command would only be needed by pipeline developers when editing the pipeline. The process code would be flat files within the pipeline repository, so nothing special for the end user (in fact, even less than using git submodules).

apeltzer commented 4 years ago

Ok I guess I didn't understand it then yet entirely - after reading again, I think I understood it now as well. I think nf-core tools extension is the way to go then. Fully flexible & we can expect developers to be able to do this when doing the dev work - for users it doesn't interfere at all 👍

ewels commented 4 years ago

Following the logic of the npm way of doing things, I guess we could then have a meta information files with the details of where each process comes from.. eg. processes.json that has the name, version, repo, hash & path for each package that has been pulled in (maybe we don’t need all of these fields? Maybe I’m missing some?).

@junjun-zhang how do you handle the management of these imported files?

junjun-zhang commented 4 years ago

@ewels what you mentioned is possible, one way or the other those information is useful to keep for dependency management. I am fairly new to Nextflow, still trying to learn more, so just to share my own thoughts here. What we are experimenting is something quick and simple but well-supports one of the most important features - explicit declaration of module dependencies down to specific versions. This is to fulfill the ultimate goal of reproducible / version controlled pipeline builds. At this point, our plan is to write a simple script (likely in Python) to detect dependencies of remote modules by searching for lines starting with include "./modules/raw.githubusercontent.com/xxxxx" in Nextflow pipeline code, then fetch the contents and store them under the local modules folder. Of course, this is very preliminary and basic, locking module content down with git commit hash etc would be great future improvement.

Dependency management is an important feature for any programming language, GO Language initially did not have good support for it. There have been numerous solutions developed by the GO community until the introduction of GO Modules. Some blogs might be interesting to read: here, here and here. I am not suggesting to take the same approach as GO Modules, but it's certainly a great source of inspiration. Ultimately, I think it's up to Nextflow language to choose it's own official approach for dependency management. For that, I'd like to hear how others think, particularly @pditommaso

pditommaso commented 4 years ago

Since there are many package managers out, there's nothing that could be used to manage NF module assets? I was even thinking to use npm. Would that be such crazy?

antunderwood commented 4 years ago

@aunderwo - I'd be curious to hear your thoughts on this one! Just reading your blog post where you mention git subrepo..

I have found subrepo (wrapper around git subtree) a more transparent way of dealing with modules particularly since the files pulled in via subrepo are not links

junjun-zhang commented 4 years ago

Since there are many package managers out, there's nothing that could be used to manage NF module assets? I was even thinking to use npm. Would that be such crazy?

That seems a bold idea, don't know npm enough to make further comment. Leveraging existing solutions is definitely plausible. Might conda be another possible option? Here is how Conda describes herself:

Package, dependency and environment management for any language—Python, R, Ruby, Lua, Scala, Java, JavaScript, C/ C++, FORTRAN, and more.

Nextflow could possibly be added to the above list?

pditommaso commented 4 years ago

Not an expert either, but my understanding is that npm allow managing any asset irrespective of the prog lang.

Conda would be even better since is very well know too for the bioinfo community, however, I'm sure it allows the copying of the module files in the project directory as we need. I think it's designed to keep them in a conda central directory which would not work for our use case. But, I repeat, I not 100% about this point.

ewels commented 4 years ago

I feel that conda might be a little bit confusing, as all of these tools are possible to install via other conda channels. I can imagine doing conda install bwa and it copying a few nextflow files somewhere random. Also as you say, I think conda always keeps stuff in its own environment directories. npm install works in the current working / project directory though, which is probably more what we want.

ewels commented 4 years ago

One down side of npm is that each package needs to be within its own git repository. This isn't necessarily all bad (we've discussed doing this before anyway). On the plus side, we can publish it within a nextflow / nf-core scope which would make the installation names pretty clear.

ewels commented 4 years ago

The more I think about this, the more I think that we should copy the approach of npm / bioconda but build this functionality in to the nf-core tools package. This is already a dependency, so doesn't add any complexity for developers, and means that we have complete control of how we want the system to work.

This is of course less good as a general nextflow (not nf-core) option, but I think that maybe that is ok for now.

pditommaso commented 4 years ago

Thought having an ad-hoc nf-core package manager tool surely streamline the experience for the final user, I would suggest resisting the temptation to create yet another package manager and related specification (metafiles? how to manage releases? version numbers, etc).

Maybe a compromise could be to implement a wrapper over an existing package manager to simplify/hide the interaction for the final user and at the same rely on a well-established package managing foundation.

The external dependency with conda/npm/etc I don't think it's so critical because the module files would be in any case included in the GH project repository. Therefore the pipeline user would not need to use any third-party package manager. It would only be required by the pipeline curator when update/sync the deps.

ewels commented 4 years ago

Yes this was my initial thought as well. But I still see two main drawbacks:

We will have to have a separate repository for every tool / process we want to import
We have a lot of pipeline curators! So the extract dependencies are still a little irritating.. (though not too bad and if commands are wrapped then I agree it's not a big deal)

maxulysse commented 4 years ago

I like the conda idea, I think that it'll fit well within the bioinfo community as well

olgabot commented 4 years ago

Conda makes a lot of sense since many people (not including me) have submitted bioconda recipes and there's already some tooling there we can use

drpatelh commented 4 years ago

I realise it has its advantages but Im not too keen on having a separate repository for each module because:

We will most likely have 100s-1000s of modules
It will make things more difficult to find, track, update and maintain

Im not entirely sure how we could make modules fit within the Conda ecosystem so if anyone has any ideas as to a more formal implementation that would be useful to hear 👍

maxulysse commented 4 years ago

I think the way bioconda handles everything in one repo is a proof that it can be done.

pditommaso commented 4 years ago

Still not sure that Conda can stage artefacts in the project directory, instead of using its own managed directory. Does anybody know about that?

grst commented 4 years ago

I have been experimenting with it. I think conda would be feasible.

Installing into the project directory is possible with conda create -p ./modules <package>.
Every module could have a bioconda recipe named nextflow-XXX

Here is an example meta.yml recipe to build the fastqc module from nf-core/modules:

package:
  name: nextflow-fastqc 
  version: "0.0.1"

build: 
  script: mkdir $PREFIX/nextflow && cp -R tools/fastqc $PREFIX/nextflow

source:
  url:
    - https://github.com/nf-core/modules/archive/master.zip

Build the package:

conda build nextflow-fastqc

Install the package in the ./modules directory:

conda create -p ./modules grst::nextflow-fastqc

The ./modules directory now looks like this:

modules
├── conda-meta
│   ├── history
│   └── nextflow-fastqc-0.0.1-0.json
└── nextflow
    └── fastqc
        ├── main.nf
        ├── meta.yml
        └── test
            ├── main.nf
            └── nextflow.config

Installing multiple nextflow-xxx packages would be no problem, and conda would take care of versions.

maxulysse commented 4 years ago

That's actually a good point.

I'm guessing it should be possible to use Conda to copy files from GitHub.

I just tink that conda would be easier than npm to use and will have the advantage of already being used by a majority of bioinformaticians.

Hypothetically, I would see a modules.yml file like that:

name: nf-core-sarek-modules-3.0
channels:
  - nf-core-modules
dependencies:
  - bwa=0.7.17
  - gatk4-spark=4.1.4.1
  - tabix=0.2.6
  - samtools=1.9
  - nf-core-header=0.2
  - nf-core-sarek=3.0

and with a command similar to conda create modules.yml I would get all my modules with the right versions in the current directory.

But I am no expert at all neither with Conda, nor with npm...

EDIT: Now that I have seen @grst reply, I'm getting more and more convinced it could be possible

pditommaso commented 4 years ago

Important: my suggestion is to use Conda (or npm, etc) to allow the pipeline developer to fetch one or more specific modules/versions and include them in the pipeline project.

The pipeline user is not expected to have any interaction with the package manager since it will get the modules along the pipeline using the usual nextflow pull/run commands.

Think it's clear, just to make sure all agree on this.

ewels commented 4 years ago

Nice! I didn't know about the -p flag for conda, thanks @grst - I was on the same page as @pditommaso and didn't think it would be possible.

Two thoughts:

I don't think that we will be able to / should host these wrapper with bioconda
- It will be very confusing for people to do conda search bwa and get a combination of tool installations and wrappers.
- It's a comparatively small community of people who will need to use conda here, only pipeline developers
  - Bioconda is for packaged software, which this is not
I think that we should prefix package names, even in a custom channel
- eg. nf-bwa will help to keep the package name as unique as possible, and obvious that this is a wrapper and not the tool itself.

I think we probably want to go down the route of making a new conda channel, see docs. Hopefully we can get this hosted on anaconda.org for free still, maybe under a nf-core namespace.

And yes - to confirm, the idea would be that devs run these conda commands and then the imported files are kept under version control with the rest of the pipeline code. Feels a little dirty, but I think it has to be this way.

Phil

pditommaso commented 4 years ago

+1 to have a dedicated nf-core channel on Conda (!)

grst commented 4 years ago

What's the rationale of not having the user download the packages? To me that would feel like the cleaner solution... And the conda command could probably easily be wrapped into nf-core downlaod or even nextflow pull.

I agree that a dedicated conda channel is probably the best solution. The advantage I see with bioconda is that we could take advantage of their already established bot system for automated builds.

ewels commented 4 years ago

What's the rationale of not having the user download the packages?

Currently the only dependencies are Java + Nextflow, plus some kind of software. Building this fetch in adds a dependency for conda for all users. It also complicates things for running offline (this could be helped by custom tools such as nf-core download as you say).

By keeping this to developers only and keeping all required nextflow source code in the repo, we add no dependencies and no extra complexity. All systems continue to work as they are, nothing changes.

In contrast, the only advantage of getting the user to pull the wrappers that I can see is that the version control history is cleaner. In my opinion that's a fairly minor thing and much less important than ease of use.

ewels commented 4 years ago

I made an nf-core organisation on anaconda cloud: https://anaconda.org/nf-core so it's there if we want it.

pditommaso commented 4 years ago

Awesome, think there are all the pieces for a pilot module!

ewels commented 4 years ago

I'm still slightly skeptical about how much work is needed to build the kind of infrastructure that bioconda has to manage the automation of packaging for conda. But I agree that this seems like a nice path forward.

To play devil's advocate, I think that there's still an argument for making our own custom system:

Bioconda has to host big assets with non-negligible file sizes for multiple platforms, we don't - we just have a handful of very short text files (curl from github is no problem)
Conda is good at handling nested dependencies, which is super tricky. I don't think that our nextflow wrappers will ever have the same kind of dependency network (?), so we probably don't need this functionality.
We will potentially need to write and maintain a lot of code to handle the maintenance of the anaconda cloud channel, in the form of CI scripts and packages. Like, seriously, the simplicity of adding a bioconda package totally does not represent the complexity of the back end that powers it.

I think the only viable alternative is to write something new in the nf-core package. I know that this isn't popular as it's making yet another packaging tool. But it could be a hell of a lot simpler and easier to write / manage:

We have a metadata file in the pipeline that tracks each imported file and the git hash it comes from in the modules repo
nf-core modules install / update etc searches GitHub modules repo, pulls the one small text file and saves it, updates the metadata file.
nf-core lint deletes all local module code and pulls again according to the metadata file. Any diffs will indicate that the module code has been tampered with and is dirty.
If we want to be able to support tool versions as well as just git commit hashes, we could easily automate a metadata file in the modules repo that lists all tools and all versions that they have ever had, with corresponding repo hashes. This would just be a case of looping through commits and watching when the tool metadata file updates the version number.

In contrast, I think I could probably write this code in a morning (or two). We have no new dependencies (local, or web-packaging wise, eg. anaconda cloud). By mimicking other tools the familiarity in functionality and usage would be comparable so devs wouldn't really have to learn anything new. It would also probably be easier to test and lint.

I'm being a little provocative here deliberately as I think this is a really important decision. I would appreciate counter-arguments, especially pointing out any concrete advantages that using conda / alternatives have over the new hand-coded option described above.

drpatelh commented 4 years ago

I agree that both Conda and a custom package manager are viable options :+1: and that it may be overkill and possibly alot of time (we dont have) to get everything set-up properly on the Conda back-end.

@ewels would the custom option work for developers that want to use our modules in the general Nextflow community. I feel this is quite an important point. If we are going through this much effort to get it right it should work for everyone.

grst commented 4 years ago

@ewels, I can see your point and it might indeed be overkill.

Here are some points in favor of conda:

Bioconda has to host big assets with non-negligible file sizes for multiple platforms, we don't - we just have a handful of very short text files (curl from github is no problem)

Arguably, the modules won't become big, but they could consist of several files (e.g. helper processes, scripts in a bin folder, ...). This can, of course, be handled by a custom script but with conda it would work with no additional effort.

Conda is good at handling nested dependencies, which is super tricky. I don't think that our nextflow wrappers will ever have the same kind of dependency network (?), so we probably don't need this functionality.

Is that really the case? I would envisage larger modules ("sub-workflows") to exist that depend on basic modules (e.g. a DNA-seq subworkflow that depends on some QC and alignment modules. The DNA-seq module could then become part of, e.g., variant-calling pipeline).

Especially, as nf-core/modules grows, this could become more and more tricky and conda is proven to handle that well.

We will potentially need to write and maintain a lot of code to handle the maintenance of the anaconda cloud channel, in the form of CI scripts and packages. Like, seriously, the simplicity of adding a bioconda package totally does not represent the complexity of the back end that powers it.

I don't think it's that bad. A minimal working solution would run conda build on each recipe that was modified. I could probably implement that also in "a morning (or two)" in Github actions. I don't say that it can't become more complicated because, as always, there will be some caveats, but this is at least as true for a homebrewed solution.

We have a metadata file in the pipeline that tracks each imported file and the git hash it comes from in the modules repo

I don't think a git repository is good for keeping track of "released versions". Yes, there are tags and releases, but we want individual releases for each module. For this to work, we would at least require some external system that links a certain version number of a module to the corresponding commit hash.

ewels commented 4 years ago

For subworkflows - I guess I envisioned workflows to always their own repository, and this to only ever be for processes. But yes, if we're generalising beyond the strict confines of nf-core then this is of course a possibility and a potentially powerful tool.

I am definitely encouraged by your worked example above @grst - do you think you could do a draft PR to start to sketch out the build and push to anaconda cloud? Let me know if you'd like me to add you as to the nf-core anaconda cloud organisation.

Even if we use conda, I think it could still be good to wrap the conda commands in nf-core tools as I could imagine people (me) missing the critical -p flag on a regular basis..

For this to work, we would at least require some external system that links a certain version number of a module to the corresponding commit hash.

We kind of do this on the nf-core website with pipelines already, but it's more for convenience only as GitHub is the real store of this information in the pipeline releases.

ewels commented 4 years ago

@pditommaso - what do you think about building some of this functionality into nextflow itself? We can already do nextflow pull to maintain a cache of workflows. We could conceivably wrap the kind of behaviour described above in to nextflow too - this may be a way to avoid committing the imported process code in to the workflow git repository (would need some thought for running offline however).

pditommaso commented 4 years ago

what do you think about building some of this functionality into nextflow itself?

Maybe in the future, but surely not in the next 12 months. I agree a wrapper could be convenient over another tool could be convenient to simplify the quick start for novice users.

At the same time, I see the benefits of adopting a well know package manager because they have already solved many of these problems (versioning, checksum verifications, dependency tree management). This is the classic problem that at the beginning looks simple but soon escalate in something much more complex. Moreover adopting an established platform allows you to benefits from the existing ecosystem. For example, GitHub could host npm packages.

Last thing, I'm not getting what big assets you are referring in this statement

Bioconda has to host big assets with non-negligible file sizes for multiple platforms, we don't - we just have a handful of very short text files (curl from github is no problem)

I think Conda package for a NF module, would only require a yaml metadata file and the tar of the module itself.

grst commented 4 years ago

I am definitely encouraged by your worked example above @grst - do you think you could do a draft PR to start to sketch out the build and push to anaconda cloud? Let me know if you'd like me to add you as to the nf-core anaconda cloud organisation.

If conda is the way to go now, I could try to put something together.

In that case a question is if we want to

have conda recipes in a different repository (e.g. nextflow-modules-recipes)
automatically generate conda recipes from the yaml documentation. (No boilerplate code, but no possibility for customizing the build process should that ever become relevant).
turn the yaml module documentation into a conda recipe (see https://github.com/nf-core/modules/issues/1#issuecomment-585075793) (a bit of conda-boilerplate, but full flexibility)

ewels commented 4 years ago

@pditommaso:

For example, GitHub could host npm packages.

Ooh, interesting thought! Especially as it looks like GitHub npm registry doesn't have the same requirement as the main npm registry of having one repository per package: GitHub npm docs. So that becomes a viable option again.

I think Conda package for a NF module, would only require a yaml metadata file and the tar of the module itself.

Yup, I think we are saying the same thing. My point was that regular bioconda software packages have assets and that we don't.

@grst:

I think it would be cool if you could put something together. I think that's the only way that this will move forward now - if we start sketching out functioning skeletons for one or more options. Conda seems like the most viable to me right now.

have conda recipes in a different repository (e.g. nextflow-modules-recipes)

I don't really understand your question here? My thinking was that each recipe would be in a directory of this repository (nf-core/modules).

conda recipes yaml + meta / docs

I started writing that we could probably build this in to the conda yaml, then saw that your final bullet point was suggesting this! Yes I think it's probably a better idea, instead of using a new custom yaml format..

nf-core / modules

Handle module / process imports #8