nf-core / tools

Python package with helper tools for the nf-core community.
https://nf-co.re
MIT License
218 stars 182 forks source link

installing missing modules from other repos #2497

Open anoronh4 opened 8 months ago

anoronh4 commented 8 months ago

Description of feature

i am attempting to install a subworkflow from a custom repository, but i am having issues installing the missing modules from the nf-core/modules repo:

$ nf-core subworkflows  --branch develop --git-remote git@github.com:mskcc-omics-workflows/modules.git install bwa_markdup_bqsr

                                          ,--./,-.
          ___     __   __   __   ___     /,-._.--~\
    |\ | |__  __ /  ` /  \ |__) |__         }  {
    | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                          `._,._,'

    nf-core/tools version 2.10 - https://nf-co.re

INFO     Installing 'bwa_markdup_bqsr'                                                                                                                                           
ERROR    Module 'bwa/mem' not found in list of available modules.                                                                                                                
INFO     Use the command 'nf-core modules list' to view available software                                                                                                       
ERROR    Module 'gatk4/markduplicates' not found in list of available modules.                                                                                                   
INFO     Use the command 'nf-core modules list' to view available software                                                                                                       
ERROR    Module 'gatk4/markduplicates/spark' not found in list of available modules.                                                                                             
INFO     Use the command 'nf-core modules list' to view available software                                                                                                       
ERROR    Module 'gatk4/applybqsr' not found in list of available modules.                                                                                                        
INFO     Use the command 'nf-core modules list' to view available software                                                                                                       
ERROR    Module 'gatk4/applybqsr/spark' not found in list of available modules.                                                                                                  
INFO     Use the command 'nf-core modules list' to view available software                                                                                                       
ERROR    Module 'gatk4/baserecalibrator' not found in list of available modules.                                                                                                 
INFO     Use the command 'nf-core modules list' to view available software                                                                                                       
ERROR    Module 'gatk4/baserecalibrator/spark' not found in list of available modules.                                                                                           
INFO     Use the command 'nf-core modules list' to view available software                                                                                                       
ERROR    Module 'samtools/index' not found in list of available modules.                                                                                                         
INFO     Use the command 'nf-core modules list' to view available software                                                                                                       
INFO     Use the following statement to include this subworkflow:                                                                                                                

 include { BWA_MARKDUP_BQSR } from '../subworkflows/msk/bwa_markdup_bqsr/main'                                                                                                                                                                                                                                                                               

the subworkflow installs but I get a few errors telling me i should install certain modules. however when i attempt to the install them from nf-core i get weird and uninformative errors:

$ nf-core --verbose modules install -d . gatk4/markduplicatesspark

                                          ,--./,-.
          ___     __   __   __   ___     /,-._.--~\
    |\ | |__  __ /  ` /  \ |__) |__         }  {
    | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                          `._,._,'

    nf-core/tools version 2.10 - https://nf-co.re

Pulling from 'nf-core/modules' (https://github.com/nf-core/modules.git) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Waiting for responseDEBUG    Popen(['git', 'version'], cwd=/home/noronhaa/.config/nfcore/nf-core/modules, universal_newlines=False, shell=None, istream=None)                              cmd.py:931
DEBUG    Popen(['git', 'fetch', '-v', '--progress', 'origin'], cwd=/home/noronhaa/.config/nfcore/nf-core/modules, universal_newlines=True, shell=None, istream=None)   cmd.py:931
DEBUG    Popen(['git', 'checkout', 'master'], cwd=/home/noronhaa/.config/nfcore/nf-core/modules, universal_newlines=False, shell=None, istream=None)                   cmd.py:931
DEBUG    Popen(['git', 'merge', 'origin/master'], cwd=/home/noronhaa/.config/nfcore/nf-core/modules, universal_newlines=False, shell=None, istream=None)               cmd.py:931
DEBUG    Using config file: /home/noronhaa/.config/nfcore/nf-core/modules/.nf-core.yml                                                                              utils.py:1010
DEBUG    Using config file: .nf-core.yml                                                                                                                            utils.py:1010
ERROR    'bwa/mem'                                                                                                                                                __main__.py:658

i may be wrong, but i think this is happening due to the recreate_dependencies code: https://github.com/nf-core/tools/blob/9ab896c140ee3fde816d250f3d8353aee5547d78/nf_core/modules/modules_json.py#L1158-L1181 -- where it keeps parsing the subworkflow file for dependencies and failing when the dependency cannot be found -- even though i'm trying to install the dependency.

I know that nf-core is currently not supporting hybrid-repo functionality. still, i was wondering if it might make sense to skip or throw a warning, rather than make the tool quit altogether. OR filter out dependencies that require modules from other repos and exclude them from any checking? not sure how this will play out in the broader sense.

awgymer commented 8 months ago

So I think that the actual errors in the second part are unrelated to the warnings from the first.

One thing I am intrigued by though is gatk4/markduplicates/spark. There shouldn't be the possibility of a module being nested more than once so I don't understand where the second slash comes from. And indeed the module in nf-core is gatk4/markduplicates

Is there any chance you can share the subworkflow main.nf file?

awgymer commented 8 months ago

Are you able to supply your modules.json file as it appears after the subworkflow install but before the attempted module install?

anoronh4 commented 8 months ago

So I think that the actual errors in the second part are unrelated to the warnings from the first.

One thing I am intrigued by though is gatk4/markduplicates/spark. There shouldn't be the possibility of a module being nested more than once so I don't understand where the second slash comes from. And indeed the module in nf-core is gatk4/markduplicates

Is there any chance you can share the subworkflow main.nf file?

The module isn't actually nested more than once, nf-core/tools is just parsing it wrong when it comes to spark tools for some reason? Here's my imports in the main.nf file:

https://github.com/anoronh4/demo_hybrid/blob/4f8161e2ab8f2c241dd84b72ba2f456d03eefdf1/subworkflows/msk/bwa_markdup_bqsr/main.nf#L1-L8

It also didn't allow me to write gatk4/markduplicatesspark in the meta.yml, it gave a linting error unless i added the extra slash.

https://github.com/anoronh4/demo_hybrid/blob/4f8161e2ab8f2c241dd84b72ba2f456d03eefdf1/subworkflows/msk/bwa_markdup_bqsr/meta.yml#L10-L18

you think that is the issue? i thought it was separate but i can look at that further.

anoronh4 commented 8 months ago

Are you able to supply your modules.json file as it appears after the subworkflow install but before the attempted module install?

Sure!

{
    "name": "nf-core/stuff",
    "homePage": "https://github.com/nf-core/stuff",
    "repos": {
        "git@github.com:anoronh4/demo_hybrid.git": {
            "subworkflows": {
                "msk": {
                    "bwa_markdup_bqsr": {
                        "branch": "main",
                        "git_sha": "71bf8700fa52a3e600390454dd75a9b3a30e4a89",
                        "installed_by": [
                            "subworkflows"
                        ]
                    }
                }
            }
        },
        "https://github.com/nf-core/modules.git": {
            "modules": {
                "nf-core": {
                    "custom/dumpsoftwareversions": {
                        "branch": "master",
                        "git_sha": "911696ea0b62df80e900ef244d7867d177971f73",
                        "installed_by": [
                            "modules"
                        ]
                    },
                    "fastqc": {
                        "branch": "master",
                        "git_sha": "bd8092b67b5103bdd52e300f75889442275c3117",
                        "installed_by": [
                            "modules"
                        ]
                    },
                    "gatk4/applybqsr": {
                        "branch": "master",
                        "git_sha": "516189e968feb4ebdd9921806988b4c12b4ac2dc",
                        "installed_by": [
                            "modules"
                        ]
                    },
                    "multiqc": {
                        "branch": "master",
                        "git_sha": "911696ea0b62df80e900ef244d7867d177971f73",
                        "installed_by": [
                            "modules"
                        ]
                    }
                }
            }
        }
    }
}

After this step i get:

$ nf-core modules install bwa/mem

                                          ,--./,-.
          ___     __   __   __   ___     /,-._.--~\
    |\ | |__  __ /  ` /  \ |__) |__         }  {
    | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                          `._,._,'

    nf-core/tools version 2.10 - https://nf-co.re

ERROR    'modules'                                     
awgymer commented 8 months ago

I have a feeling that GATK_MARKDUPLICATES_SPARK is an issue and if that's in the modules repo we may have a problem. The code that parses these things is "dumb" and I think it just splits the name on underscores and replaces with slashes

awgymer commented 8 months ago

Back to the actual issue at hand.

The ERROR 'modules' comes from the fact there is no modules section under your remote in the modules.json.

So I tried adding it and bwa/mem is installed but with a rather unhelpful side-effect. The entire custom remote is dropped:

{
    "name": "nf-core/testpipeline",
    "homePage": "https://github.com/nf-core/testpipeline",
    "repos": {
        "https://github.com/nf-core/modules.git": {
            "modules": {
                "nf-core": {
                    "bwa/mem": {
                        "branch": "master",
                        "git_sha": "516189e968feb4ebdd9921806988b4c12b4ac2dc",
                        "installed_by": ["modules"]
                    }
                }
            },
            "subworkflows": {
                "nf-core": {}
            }
        }
    }
}

Would you be able to copy a module used in the subworkflow into the modules directory of that remote by any chance so I can test something (I don't have a test remote of my own handy sorry)?

anoronh4 commented 8 months ago

i've opened a different branch called one_module and added bwa/mem and bwa/index (so technically ended up being two modules haha). and the subworkflow was changed to use the new copy of bwa/mem. i think i know what you're getting at but i think the result will be similar: "modules" key will no longer cause failure but there will be another keyerror for every module name that is not from my repo. so the error will look something like ERROR 'gatk4/markduplicates'. That's caused by this line: https://github.com/nf-core/tools/blob/9ab896c140ee3fde816d250f3d8353aee5547d78/nf_core/modules/modules_json.py#L1169

and here's the updated modules.json

the underlying issue here is that there's no way to tell nf-core where a given module ACTUALLY came from -- in fact nf-core/tools ignores org_path (msk vs nf-core) information in the subworkflow definition. the modules.json is actually highly specific but grabbing the dependencies of the subworkflow is not. if we had more specific dependency definition i think that might be more robust way to handle them.

anoronh4 commented 8 months ago

after adding the new subworkflow my modules.json file looks like this:

{
    "name": "nf-core/stuff",
    "homePage": "https://github.com/nf-core/stuff",
    "repos": {
        "git@github.com:anoronh4/demo_hybrid.git": {
            "modules": {
                "msk": {
                    "bwa/mem": {
                        "branch": "one_module",
                        "git_sha": "acc4f68fbfda2075974c90587a20b693d0a64ebb",
                        "installed_by": [
                            "bwa_markdup_bqsr"
                        ]
                    }
                }
            },
            "subworkflows": {
                "msk": {
                    "bwa_markdup_bqsr": {
                        "branch": "one_module",
                        "git_sha": "acc4f68fbfda2075974c90587a20b693d0a64ebb",
                        "installed_by": [
                            "subworkflows"
                        ]
                    }
                }
            }
        },
        "https://github.com/nf-core/modules.git": {
            "modules": {
                "nf-core": {
                    "custom/dumpsoftwareversions": {
                        "branch": "master",
                        "git_sha": "911696ea0b62df80e900ef244d7867d177971f73",
                        "installed_by": [
                            "modules"
                        ]
                    },
                    "fastqc": {
                        "branch": "master",
                        "git_sha": "bd8092b67b5103bdd52e300f75889442275c3117",
                        "installed_by": [
                            "modules"
                        ]
                    },
                    "gatk4/applybqsr": {
                        "branch": "master",
                        "git_sha": "516189e968feb4ebdd9921806988b4c12b4ac2dc",
                        "installed_by": [
                            "modules"
                        ]
                    },
                    "multiqc": {
                        "branch": "master",
                        "git_sha": "911696ea0b62df80e900ef244d7867d177971f73",
                        "installed_by": [
                            "modules"
                        ]
                    }
                }
            }
        }
    }
}

and trying to install anything else indeed gives me: ERROR 'gatk4/markduplicates'

awgymer commented 8 months ago

Ok yes. You have nailed the issue here. That get_components_to_install makes the blind assumption that any subworkflows or modules it finds are being installed from the same remote.

I would fully support a better way of defining dependent modules, but it would require buy-in from both tools devs and the wider nf-core community, because it would affect how subworkflows are written and maintained.

In theory we already track this information in the meta.yml (i.e. for bam_create_som_pon_gatk):

components:
  - gatk4/mutect2
  - gatk4/genomicsdbimport
  - gatk4/createsomaticpanelofnormals

I would be in favour of leveraging this to determine the dependencies, but also to allow us to provide remotes.

The major issue within nf-core of course would still be testing, whilst I appreciated your solution I'm not sure it is really workable for everyone.

anoronh4 commented 8 months ago

I've thought about what it would be like to use the meta.yml as well, something like this:

components:
  - name: gatk4/mutect2
  - name: gatk4/genomicsdbimport
    org_path: custom-remote
    git-remote: https://github.com/custom-remote/modules.git

where git-remote and org_path are optional and assumed to be "self" if not explicitly defined. commit sha and branch can be optionally defined as well, but otherwise can have default behavior. it would be the burden of whoever is adding or updating the subworkflow to add this information if they care about it.

i think this approach may actually be achievable. but even if it does not have buy-in from the tools devs and the wider community, i still think it would be beneficial to everyone to handle missing dependencies more gracefully. currently the error is somewhat unhelpful and doesn't even state the subworkflow that i have to remove if i want to fix the problem. nf-core lint could be used, for example, to assert the missing dependencies and show a test fail/warn -- rather than a tools error. how do we find out what the tools devs are willing to green-light? at least to get some closure on this discussion, if not move forward with a new solution.

agree that what i came up with for testing is not ideal and quite hacky, but by allowing users to start defining dependencies, i think that's when other testing solutions might start to follow. if no one can use hybrid subworkflows then no one will tackle the question of testing it either.

anoronh4 commented 8 months ago

i have outlined some steps that i think are achievable:

  1. Change "components" section of meta.yml a. Use something like above proposed structure b. Linting of meta.yml should be changed to check for something like the above structure c. update subworkflow template (used for nf-core subworkflows create)
  2. Update tools to check modules.json for dependencies using git-branch and org_path that are now specified (check_up_to_date > recreate_dependencies > get_components_to_install, for example)
  3. Change install/remove methods to use the dependencies that are described in meta.yml
  4. Create a test repo with hybrid subworkflow and write tests in nf-core/tools to make sure that a hybrid subworkflow can be installed as expected and not break subsequent nf-core functions.

As you pointed out earlier, testing would not be fully accounted for here, but until a solution is created, nf-core/modules can be configured to not accept hybrid subworkflows, while separate institutions can have the flexibility to create these subworkflows if they desire.

Please let me know your thoughts!

awgymer commented 8 months ago

So I agree mostly with the solution here, but I do see some issues.

  1. The code is not that easy to rewrite from my initial exploration. The assumptions we make and the current structure would involve overhauling several commands at once.

  2. It would be nice if nf-core tested hybrid workflows but honestly it's quite a burden and I think it's likely we settle for testing standards single-repo systems work and the hybrid-repo work is maintained as a best-effort.

I don't think nf-core would logically ever have hybrid code in itself. That just doesn't really make much sense to me and would make nf-core potentially dependent on external repositories or businesses which is... not ideal.

One possible route to making this work more easily would be for saubworkflows to package their modules with them (something I think @ewels supports in theory) but that comes with it's own headaches for tooling and (I literally just realised) would create a nightmare hellscape if we intend to support nested subworkflows still (which we currently do but I'm not sure is really wise).

ewels commented 8 months ago

What's wrong with module file inclusion with nested subworkflows?

It's just modules.json and flat files all the way down, no? 🐢 🌍

awgymer commented 8 months ago

It's more just the arbitrary depth nesting you could potentially end up with.

So

subworkflows
    Subwf1
        Modules
        Subworkflows
            Subwf2
                Modules 
                Subworkflows 
                    Subwf3
                        Modules 

And then you have to have code that can efficiently traverse this tree and keep track of versions.

Maybe it's not as bad as I'm imagining but that deeply nested structure just isn't that nice imo

anoronh4 commented 7 months ago

i'm not sure i fully understand how the nested subworkflows are supposed to work, probably because i haven't been present for previous discussions. maybe this is has already been discussed and decided against, but what about testing components using a fresh repo? meaning, instead of testing the modules repo code base directly: create a new pipeline project using nf-core create, install all dependencies in accordance with what's specified in meta.yml, and then run the test? since the nf-test folder is shipped with the module/subworkflow upon installation, i think that could be serviceable, even if the testing process takes a bit longer. just an idea. this MIGHT avoid the difficulties of nesting dependencies...i would imagine different workflows might need different versions of dependencies, but nf-core would not be the only package management ecosystem carrying that caveat.

whether or not hybrid workflows become a reality or not, should we attempt to fix the more specific issue that we've identified in this thread? i wanted to come back to what i wrote before:

it would be beneficial to everyone to handle missing dependencies more gracefully. currently the error is somewhat unhelpful and doesn't even state the subworkflow that i have to remove if i want to fix the problem. nf-core lint could be used, for example, to assert the missing dependencies and show a test fail/warn -- rather than a tools error.