Open anoronh4 opened 8 months ago
So I think that the actual errors in the second part are unrelated to the warnings from the first.
One thing I am intrigued by though is gatk4/markduplicates/spark
. There shouldn't be the possibility of a module being nested more than once so I don't understand where the second slash comes from. And indeed the module in nf-core
is gatk4/markduplicates
Is there any chance you can share the subworkflow main.nf file?
Are you able to supply your modules.json
file as it appears after the subworkflow install but before the attempted module install?
So I think that the actual errors in the second part are unrelated to the warnings from the first.
One thing I am intrigued by though is
gatk4/markduplicates/spark
. There shouldn't be the possibility of a module being nested more than once so I don't understand where the second slash comes from. And indeed the module innf-core
isgatk4/markduplicates
Is there any chance you can share the subworkflow main.nf file?
The module isn't actually nested more than once, nf-core/tools is just parsing it wrong when it comes to spark tools for some reason? Here's my imports in the main.nf file:
It also didn't allow me to write gatk4/markduplicatesspark in the meta.yml, it gave a linting error unless i added the extra slash.
you think that is the issue? i thought it was separate but i can look at that further.
Are you able to supply your
modules.json
file as it appears after the subworkflow install but before the attempted module install?
Sure!
{
"name": "nf-core/stuff",
"homePage": "https://github.com/nf-core/stuff",
"repos": {
"git@github.com:anoronh4/demo_hybrid.git": {
"subworkflows": {
"msk": {
"bwa_markdup_bqsr": {
"branch": "main",
"git_sha": "71bf8700fa52a3e600390454dd75a9b3a30e4a89",
"installed_by": [
"subworkflows"
]
}
}
}
},
"https://github.com/nf-core/modules.git": {
"modules": {
"nf-core": {
"custom/dumpsoftwareversions": {
"branch": "master",
"git_sha": "911696ea0b62df80e900ef244d7867d177971f73",
"installed_by": [
"modules"
]
},
"fastqc": {
"branch": "master",
"git_sha": "bd8092b67b5103bdd52e300f75889442275c3117",
"installed_by": [
"modules"
]
},
"gatk4/applybqsr": {
"branch": "master",
"git_sha": "516189e968feb4ebdd9921806988b4c12b4ac2dc",
"installed_by": [
"modules"
]
},
"multiqc": {
"branch": "master",
"git_sha": "911696ea0b62df80e900ef244d7867d177971f73",
"installed_by": [
"modules"
]
}
}
}
}
}
}
After this step i get:
$ nf-core modules install bwa/mem
,--./,-.
___ __ __ __ ___ /,-._.--~\
|\ | |__ __ / ` / \ |__) |__ } {
| \| | \__, \__/ | \ |___ \`-._,-`-,
`._,._,'
nf-core/tools version 2.10 - https://nf-co.re
ERROR 'modules'
I have a feeling that GATK_MARKDUPLICATES_SPARK
is an issue and if that's in the modules repo we may have a problem. The code that parses these things is "dumb" and I think it just splits the name on underscores and replaces with slashes
Back to the actual issue at hand.
The ERROR 'modules'
comes from the fact there is no modules
section under your remote in the modules.json
.
So I tried adding it and bwa/mem
is installed but with a rather unhelpful side-effect. The entire custom remote is dropped:
{
"name": "nf-core/testpipeline",
"homePage": "https://github.com/nf-core/testpipeline",
"repos": {
"https://github.com/nf-core/modules.git": {
"modules": {
"nf-core": {
"bwa/mem": {
"branch": "master",
"git_sha": "516189e968feb4ebdd9921806988b4c12b4ac2dc",
"installed_by": ["modules"]
}
}
},
"subworkflows": {
"nf-core": {}
}
}
}
}
Would you be able to copy a module used in the subworkflow into the modules
directory of that remote by any chance so I can test something (I don't have a test remote of my own handy sorry)?
i've opened a different branch called one_module
and added bwa/mem and bwa/index (so technically ended up being two modules haha). and the subworkflow was changed to use the new copy of bwa/mem. i think i know what you're getting at but i think the result will be similar: "modules" key will no longer cause failure but there will be another keyerror for every module name that is not from my repo. so the error will look something like ERROR 'gatk4/markduplicates'
. That's caused by this line:
https://github.com/nf-core/tools/blob/9ab896c140ee3fde816d250f3d8353aee5547d78/nf_core/modules/modules_json.py#L1169
and here's the updated modules.json
the underlying issue here is that there's no way to tell nf-core where a given module ACTUALLY came from -- in fact nf-core/tools ignores org_path (msk vs nf-core) information in the subworkflow definition. the modules.json is actually highly specific but grabbing the dependencies of the subworkflow is not. if we had more specific dependency definition i think that might be more robust way to handle them.
after adding the new subworkflow my modules.json file looks like this:
{
"name": "nf-core/stuff",
"homePage": "https://github.com/nf-core/stuff",
"repos": {
"git@github.com:anoronh4/demo_hybrid.git": {
"modules": {
"msk": {
"bwa/mem": {
"branch": "one_module",
"git_sha": "acc4f68fbfda2075974c90587a20b693d0a64ebb",
"installed_by": [
"bwa_markdup_bqsr"
]
}
}
},
"subworkflows": {
"msk": {
"bwa_markdup_bqsr": {
"branch": "one_module",
"git_sha": "acc4f68fbfda2075974c90587a20b693d0a64ebb",
"installed_by": [
"subworkflows"
]
}
}
}
},
"https://github.com/nf-core/modules.git": {
"modules": {
"nf-core": {
"custom/dumpsoftwareversions": {
"branch": "master",
"git_sha": "911696ea0b62df80e900ef244d7867d177971f73",
"installed_by": [
"modules"
]
},
"fastqc": {
"branch": "master",
"git_sha": "bd8092b67b5103bdd52e300f75889442275c3117",
"installed_by": [
"modules"
]
},
"gatk4/applybqsr": {
"branch": "master",
"git_sha": "516189e968feb4ebdd9921806988b4c12b4ac2dc",
"installed_by": [
"modules"
]
},
"multiqc": {
"branch": "master",
"git_sha": "911696ea0b62df80e900ef244d7867d177971f73",
"installed_by": [
"modules"
]
}
}
}
}
}
}
and trying to install anything else indeed gives me: ERROR 'gatk4/markduplicates'
Ok yes. You have nailed the issue here. That get_components_to_install
makes the blind assumption that any subworkflows or modules it finds are being installed from the same remote.
I would fully support a better way of defining dependent modules, but it would require buy-in from both tools devs and the wider nf-core community, because it would affect how subworkflows
are written and maintained.
In theory we already track this information in the meta.yml
(i.e. for bam_create_som_pon_gatk
):
components:
- gatk4/mutect2
- gatk4/genomicsdbimport
- gatk4/createsomaticpanelofnormals
I would be in favour of leveraging this to determine the dependencies, but also to allow us to provide remotes.
The major issue within nf-core
of course would still be testing, whilst I appreciated your solution I'm not sure it is really workable for everyone.
I've thought about what it would be like to use the meta.yml as well, something like this:
components:
- name: gatk4/mutect2
- name: gatk4/genomicsdbimport
org_path: custom-remote
git-remote: https://github.com/custom-remote/modules.git
where git-remote and org_path are optional and assumed to be "self" if not explicitly defined. commit sha and branch can be optionally defined as well, but otherwise can have default behavior. it would be the burden of whoever is adding or updating the subworkflow to add this information if they care about it.
i think this approach may actually be achievable. but even if it does not have buy-in from the tools devs and the wider community, i still think it would be beneficial to everyone to handle missing dependencies more gracefully. currently the error is somewhat unhelpful and doesn't even state the subworkflow that i have to remove if i want to fix the problem. nf-core lint
could be used, for example, to assert the missing dependencies and show a test fail/warn -- rather than a tools error. how do we find out what the tools devs are willing to green-light? at least to get some closure on this discussion, if not move forward with a new solution.
agree that what i came up with for testing is not ideal and quite hacky, but by allowing users to start defining dependencies, i think that's when other testing solutions might start to follow. if no one can use hybrid subworkflows then no one will tackle the question of testing it either.
i have outlined some steps that i think are achievable:
nf-core subworkflows create
)check_up_to_date
> recreate_dependencies
> get_components_to_install
, for example)As you pointed out earlier, testing would not be fully accounted for here, but until a solution is created, nf-core/modules can be configured to not accept hybrid subworkflows, while separate institutions can have the flexibility to create these subworkflows if they desire.
Please let me know your thoughts!
So I agree mostly with the solution here, but I do see some issues.
The code is not that easy to rewrite from my initial exploration. The assumptions we make and the current structure would involve overhauling several commands at once.
It would be nice if nf-core tested hybrid workflows but honestly it's quite a burden and I think it's likely we settle for testing standards single-repo systems work and the hybrid-repo work is maintained as a best-effort.
I don't think nf-core
would logically ever have hybrid code in itself. That just doesn't really make much sense to me and would make nf-core
potentially dependent on external repositories or businesses which is... not ideal.
One possible route to making this work more easily would be for saubworkflows to package their modules with them (something I think @ewels supports in theory) but that comes with it's own headaches for tooling and (I literally just realised) would create a nightmare hellscape if we intend to support nested subworkflows still (which we currently do but I'm not sure is really wise).
What's wrong with module file inclusion with nested subworkflows?
It's just modules.json
and flat files all the way down, no? 🐢 🌍
It's more just the arbitrary depth nesting you could potentially end up with.
So
subworkflows
Subwf1
Modules
Subworkflows
Subwf2
Modules
Subworkflows
Subwf3
Modules
And then you have to have code that can efficiently traverse this tree and keep track of versions.
Maybe it's not as bad as I'm imagining but that deeply nested structure just isn't that nice imo
i'm not sure i fully understand how the nested subworkflows are supposed to work, probably because i haven't been present for previous discussions. maybe this is has already been discussed and decided against, but what about testing components using a fresh repo? meaning, instead of testing the modules repo code base directly: create a new pipeline project using nf-core create
, install all dependencies in accordance with what's specified in meta.yml, and then run the test? since the nf-test folder is shipped with the module/subworkflow upon installation, i think that could be serviceable, even if the testing process takes a bit longer. just an idea. this MIGHT avoid the difficulties of nesting dependencies...i would imagine different workflows might need different versions of dependencies, but nf-core would not be the only package management ecosystem carrying that caveat.
whether or not hybrid workflows become a reality or not, should we attempt to fix the more specific issue that we've identified in this thread? i wanted to come back to what i wrote before:
it would be beneficial to everyone to handle missing dependencies more gracefully. currently the error is somewhat unhelpful and doesn't even state the subworkflow that i have to remove if i want to fix the problem. nf-core lint could be used, for example, to assert the missing dependencies and show a test fail/warn -- rather than a tools error.
Description of feature
i am attempting to install a subworkflow from a custom repository, but i am having issues installing the missing modules from the nf-core/modules repo:
the subworkflow installs but I get a few errors telling me i should install certain modules. however when i attempt to the install them from nf-core i get weird and uninformative errors:
i may be wrong, but i think this is happening due to the recreate_dependencies code: https://github.com/nf-core/tools/blob/9ab896c140ee3fde816d250f3d8353aee5547d78/nf_core/modules/modules_json.py#L1158-L1181 -- where it keeps parsing the subworkflow file for dependencies and failing when the dependency cannot be found -- even though i'm trying to install the dependency.
I know that nf-core is currently not supporting hybrid-repo functionality. still, i was wondering if it might make sense to skip or throw a warning, rather than make the tool quit altogether. OR filter out dependencies that require modules from other repos and exclude them from any checking? not sure how this will play out in the broader sense.