Closed alxndrdiaz closed 1 year ago
This is the new PR for busco subworkflow. I am still working on fixing the linting errors.
Can you please resolve the conflicts listed as well? I will run the tests for you and update with results.
for error: actions_ci: '.github/workflows/ci.yml' does not check minimum NF version
(nf-core linting / nf-core (push)
) , there is a documentation that explains the format of this file: https://nf-co.re/tools/docs/latest/pipeline_lint_tests/actions_ci.html , it is also helpful to check an example file: https://github.com/nf-core/rnaseq/blob/master/.github/workflows/ci.yml
for the error related to PythonBlack check: first install it using pip install black
the from the directory containing the pipeline repository runblack .
, this should fix file formatting, finally commit the changes.
nextflow run . -profile test,singularity
is failing with error:
ERROR: Validation of pipeline parameters failed!
* --input: string [mMelMel3] does not match pattern ^\S+\.csv$ (mMelMel3)
Might be related to how the input is passed in conf/test.config:
// Input test data
input = "${projectDir}/assets/samplesheet.csv"
fasta = "https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/assembly/release/mMelMel3.1_paternal_haplotype/GCA_922984935.2.subset.fasta.gz"
Instead of:
input = "mMelMel3"
project = ".sandbox"
For input
both should work. I think you might need to edit the nextflow_schema
or other groovy functions that check for a csv
to make sure input
allows both input formats. And the error you have above, seems to be triggered because you are passing input = mMelMel3
not the csv.
You will face issues with project_basedir
. Earlier all projects were either based in scratch123
or hyperlinked to it, but now they are split into either scratch123
or scratch124
so you might consider changing how your input_check
works.
For my pipelines, I ask for the full project directory now instead of just the project name. Example: https://github.com/sanger-tol/genomenote/blob/update_dev/bin/tol_input.sh
For errors related to formatting of nextflow_schema.json
, use a local installation of prettier and run:
npx prettier --write nextflow_schema.json
Then, to check that the format was updated:
npx prettier --check nextflow_schema.json
Finally commit the changes.
Have you tested the pipeline with both types of input? On gen3
you can start an interactive session and run your tests. If you want to submit jobs via bsub
, do that via the interactive session too, since the head nodes are unavailable.
I tried testing the code nextflow run . -profile sanger,test,singularity
but it fails.
Error executing process > 'NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:GOAT_TAXONSEARCH ([null, GCA_922984935.2.subset])'
I am pretty sure it has to do with the tag being in square brackets. LSF does not like that. The input to goat/taxonsearch
is not being passed correctly. Can you please fix that?
Have you tested the pipeline with both types of input? On
gen3
you can start an interactive session and run your tests. If you want to submit jobs viabsub
, do that via the interactive session too, since the head nodes are unavailable.
I tried from an interactive session (and also submitted a job but it seems that it does not start running), but got the following error:
Error executing process > 'NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:INPUT_CHECK:GUNZIP (GCA_922984935.2.subset.fasta.gz)'
Caused by:
Failed to submit process to grid scheduler for execution
Command executed:
bsub
Command exit status:
255
....
I tried testing the code
nextflow run . -profile sanger,test,singularity
but it fails.Error executing process > 'NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:GOAT_TAXONSEARCH ([null, GCA_922984935.2.subset])'
I am pretty sure it has to do with the tag being in square brackets. LSF does not like that. The input to
goat/taxonsearch
is not being passed correctly. Can you please fix that?
It looks like the error is related to goat/taxonsearch
input (see code below) as you mentioned. Similar transformations are used within busco_diamond_blastp.nf
so I will have to check if they need to be changed as well.
GOAT_TAXONSEARCH (
fasta.map { fa -> [fa[0], "${params.taxon}", "${params.taxa_file}" ? file("${params.taxa_file}") : []] }
)
As I understood the brackets that might be causing the error are the ones that are used in [fa[0], "${params.taxon}", "${params.taxa_file}" ? file("${params.taxa_file}") : []]
, right?
The issue comes from the way fasta
is passed in workflows/blobtoolkit.nf
. The fasta channel you pass looks like this:
[[id:[null, GCA_922984935.2.subset]], [[id:GCA_922984935.2.subset], /lustre/scratch123/tol/teams/tolit/users/ps22/pipelines/blobtoolkit/work/93/d1f37d6d43a418a94f90c377d87016/GCA_922984935.2.subset.fasta]]
It should look like this:
[[id:GCA_922984935.2], /lustre/scratch123/tol/teams/tolit/users/ps22/pipelines/blobtoolkit/work/93/d1f37d6d43a418a94f90c377d87016/GCA_922984935.2.subset.fasta]
The channel was already in the correct format from input_check
, so the modifications you added were breaking it. Use view()
often and make sure you pass what you expect to pass.
Other issues:
dumpsoftwareversions
please add that back.params
to the workflow and only pass channels to the subworkflow.The issue comes from the way
fasta
is passed inworkflows/blobtoolkit.nf
. The fasta channel you pass looks like this:[[id:[null, GCA_922984935.2.subset]], [[id:GCA_922984935.2.subset], /lustre/scratch123/tol/teams/tolit/users/ps22/pipelines/blobtoolkit/work/93/d1f37d6d43a418a94f90c377d87016/GCA_922984935.2.subset.fasta]]
It should look like this:
[[id:GCA_922984935.2], /lustre/scratch123/tol/teams/tolit/users/ps22/pipelines/blobtoolkit/work/93/d1f37d6d43a418a94f90c377d87016/GCA_922984935.2.subset.fasta]
The channel was already in the correct format from
input_check
, so the modifications you added were breaking it. Useview()
often and make sure you pass what you expect to pass.
Done
5ac0528 - alxndrdiaz, Tue Dec 6 18:28:48 2022 -0600 : fixed bad input in BUSCO_DIAMOND subworkflow
With this change now the output of looks something like this:
nextflow run . -profile sanger,test,singularity
[c1/0f6328] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:INPUT_CHECK:GUNZIP (GCA_922984935.2.subset.fasta... [100%] 1 of 1 ✔
[0d/545f4b] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:INPUT_CHECK:SAMPLESHEET_CHECK (samplesheet.csv) [100%] 1 of 1 ✔
[a4/2b018d] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:GOAT_TAXONSEARCH (GCA_922984935.2.... [100%] 1 of 1 ✔
[fc/6d94dd] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:BUSCO (GCA_922984935.2.subset) [100%] 10 of 10 ✔
[9b/fd6b68] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:EXTRACT_BUSCO_GENES (GCA_922984935... [100%] 1 of 1 ✔
[bb/8d61c9] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:DIAMOND_BLASTP (GCA_922984935.2.su... [ 83%] 5 of 6, failed:
Error executing process > 'NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:DIAMOND_BLASTP (GCA_922984935.2.subset)'
Other issues:
1. There was no versions for BUSCO in the workflow. You need to add that. 2. Your workflow is also missing modules`dumpsoftwareversions` please add that back. 3. I would recommend you keep `params` to the workflow and only pass channels to the subworkflow. 4. Your formatting of the workflow and subworkflow needs some work. Take a look at the nf-core pipelines or one of ours. Example: https://github.com/sanger-tol/genomenote/tree/dev
Commits for (1) and (2):
d3ddbd2 - alxndrdiaz, Tue Dec 6 19:42:16 2022 -0600 : added ch_versions for BUSCO_DIAMOND subworkflow
fe957f0 - alxndrdiaz, Tue Dec 6 19:51:54 2022 -0600 : added CUSTOM_DUMPSOFTWAREVERSIONS to workflow
I am still working on (3) and (4).
Regarding the error with DIAMOND_BLASTP:
busco_sequences
which has the path: ${busco_dir}/*/*/busco_sequences
path
argument to EXTRACT_BUSCO_GENES
, this will stage busco_sequences
in the same folder as the tsv
files and should get the step working. EXTRACT_BUSCO_GENES
also update the container to latest and add an error for conda, see here: https://github.com/nf-core/modules/blob/5e34754d42cd2d5d248ca8673c0a53cdf5624905/modules/nf-core/cellranger/mkref/main.nffasta
file after extract_busco_genes
and then skip diamond_blastp
Let me know if this is unclear. We are definitely getting to the finish line. 👍
@alxndrdiaz Further clarifications:
So the tar
module should create the following structure:
archaea_odb10
-- full_table.tsv.gz
-- busco_sequences.tar.gz
Same for bacteria_odb10
and eukaryota_odb10
. Then these three folders need to be staged in the work
directory for extract_busco_genes
.
The input will then be:
btk pipeline extract-busco-genes \
--busco archaea_odb10/full_table.tsv.gz \
--busco bacteria_odb10/full_table.tsv.gz \
--busco eukaryota_odb10/full_table.tsv.gz \
--out GCA_922984935.2.subset_busco_genes.fasta
The folder name is important because then the extract_busco_genes.py
script looks for archaea_odb10/busco_sequences.tar.gz
to create the fasta output. Does that make sense?
/full_table.tsv.gz
I already added the changes to prepare the input for the module (arc, bac, euk
are paths from tar module) but now there is a Nextflow error:
btk pipeline extract-busco-genes \\
--busco $arc/full_table.tsv.gz \\
--busco $bac/full_table.tsv.gz \\
--busco $euk/full_table.tsv.gz \\
--out ${prefix}_busco_genes.fasta
Error message:
Error executing process >
'NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:EXTRACT_BUSCO_GENES (GCA_922984935.2.subset)'
Caused by:
Process `NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:EXTRACT_BUSCO_GENES` input file name collision -- There are multiple input files for each of the following file names: busco_sequences.tar.gz, full_table.tsv.gz
Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
Regarding the error with DIAMOND_BLASTP:
1. Tar the `busco_sequences` which has the path: `${busco_dir}/*/*/busco_sequences` 2. Pass it as a `path` argument to `EXTRACT_BUSCO_GENES`, this will stage `busco_sequences` in the same folder as the `tsv` files and should get the step working. 3. While you are updating the `EXTRACT_BUSCO_GENES` also update the container to latest and add an error for conda, see here: https://github.com/nf-core/modules/blob/5e34754d42cd2d5d248ca8673c0a53cdf5624905/modules/nf-core/cellranger/mkref/main.nf 4. Also add a check for empty `fasta` file after `extract_busco_genes` and then skip `diamond_blastp`
Let me know if this is unclear. We are definitely getting to the finish line. +1
This is the condition used to process fasta files from extract_busco_genes:
// runs DIAMOND_BLASTP if fasta file from EXTRACT_BUSCO_GENE is not empty
empty_fasta = EXTRACT_BUSCO_GENES.out.fasta.map { meta,p -> file("$p").isEmpty() }
if ( empty_fasta == false ) {
dmd_input_ch = EXTRACT_BUSCO_GENES.out.fasta
}
else {
dmd_input_ch = Channel.empty()
}
DIAMOND_BLASTP (
dmd_input_ch,
blastp_db,
"${params.blastp_outext}",
"${params.blastp_cols}"
)
ch_versions = ch_versions.mix(DIAMOND_BLASTP.out.versions)
commits:
428cbe3 - @alxndrdiaz, Thu Jan 12 20:25:45 2023 -0600 : emit blastp_txt = DIAMOND_BLASTP.out.txt
52a3799 - @alxndrdiaz, Thu Jan 12 20:19:20 2023 -0600 : DIAMOND_BLASTP input channel is evaluated before eecution
Pipeline runs successfully with test
profile but it does not even execute DIAMOND_BLASTP or even lists that as a process. Even if the extract_busco_genes
output is empty, it should list the process and then skip it.
[38/fa2d50] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:INPUT_CHECK:GUNZIP (GCA_922984935.2.subset.fasta.gz) [100%] 1 of 1 ✔
[44/b97801] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:INPUT_CHECK:SAMPLESHEET_CHECK (samplesheet.csv) [100%] 1 of 1 ✔
[37/8a65b8] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:GOAT_TAXONSEARCH (GCA_922984935.2.subset) [100%] 1 of 1 ✔
[64/8a1e9b] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:BUSCO (GCA_922984935.2.subset) [100%] 10 of 10 ✔
[55/5f7c32] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:TAR (GCA_922984935.2.subset) [100%] 1 of 1 ✔
[ca/fea9bd] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:EXTRACT_BUSCO_GENES (GCA_922984935.2.subset) [100%] 1 of 1 ✔
[3d/848c61] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:CUSTOM_DUMPSOFTWAREVERSIONS (1) [100%] 1 of 1 ✔
-[nf-core/blobtoolkit] Pipeline completed successfully-
Pipeline runs successfully with
test
profile but it does not even execute DIAMOND_BLASTP or even lists that as a process. Even if theextract_busco_genes
output is empty, it should list the process and then skip it.[38/fa2d50] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:INPUT_CHECK:GUNZIP (GCA_922984935.2.subset.fasta.gz) [100%] 1 of 1 ✔ [44/b97801] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:INPUT_CHECK:SAMPLESHEET_CHECK (samplesheet.csv) [100%] 1 of 1 ✔ [37/8a65b8] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:GOAT_TAXONSEARCH (GCA_922984935.2.subset) [100%] 1 of 1 ✔ [64/8a1e9b] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:BUSCO (GCA_922984935.2.subset) [100%] 10 of 10 ✔ [55/5f7c32] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:TAR (GCA_922984935.2.subset) [100%] 1 of 1 ✔ [ca/fea9bd] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:EXTRACT_BUSCO_GENES (GCA_922984935.2.subset) [100%] 1 of 1 ✔ [3d/848c61] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:CUSTOM_DUMPSOFTWAREVERSIONS (1) [100%] 1 of 1 ✔ -[nf-core/blobtoolkit] Pipeline completed successfully-
It should look different now:
[e8/f8c663] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:INPUT_CHECK:GUNZIP (GCA_922984935.2.subset.fasta.gz) [100%] 1 of 1 ✔
[47/a82249] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:INPUT_CHECK:SAMPLESHEET_CHECK (samplesheet.csv) [100%] 1 of 1 ✔
[cc/0ef692] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:GOAT_TAXONSEARCH (GCA_922984935.2.subset) [100%] 1 of 1 ✔
[fc/5ed408] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:BUSCO (GCA_922984935.2.subset) [100%] 10 of 10 ✔
[6d/15ee31] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:TAR (GCA_922984935.2.subset) [100%] 1 of 1 ✔
[a1/a9bc34] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:EXTRACT_BUSCO_GENES (GCA_922984935.2.subset) [100%] 1 of 1 ✔
[- ] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:DIAMOND_BLASTP -
[2e/2b480f] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:CUSTOM_DUMPSOFTWAREVERSIONS (1) [100%] 1 of 1 ✔
Can you undo the deletes you made related to the coverage subworkflow? When you merge, we want both of the subworkflows working. Preferably, reset the deletions, rather than adding them back, that way we maintain proper credits.
I could identify only one commit that was a file deletion from a commit related to the coverage subworkflow:
79f2b27 - alxndrdiaz, Mon Nov 28 20:19:31 2022 -0600 : deleted unused module create_bed.nf
To revert this commit:
git revert 79f2b27
Output:
[busco_clean 45bdd63] Revert "deleted unused module create_bed.nf"
1 file changed, 31 insertions(+)
create mode 100644 modules/local/create_bed.nf
Hi @alxndrdiaz . Priyanka's request was to remove the commit, e.g. through a rebase, rather than reverting it, so that the files retain the original creation attributes (date and author) in the Git history.
Also, this PR deletes more than just create_bed.nf
. These are missing and the commit that deletes them needs to be removed:
The PR also modifies a lot of the existing files by removing existing code, documentation, configuration, etc, that Zaynab pushed to dev recently. I may have missed some, but there are
And some files unnecessarily modified:
I think what happened is that your commits remove those files and add them back to whatever stage they were at when you did the rebase. Because of that you're missing all the changes that Zaynab pushed recently.
Can you please remove/rework those commits ? I think they should almost be no file/line deletion in your PR at all. The diff should be cleanly showing additions: new files and new lines of codes, typically modules, sub-workflows, the main workflow, modules.json.
Maybe that'll be very hard to do and it'll be easier to forget about the existing commits and make new, clean, ones onto a new branch that branches off the current dev
. But as it is, we can't merge the current merge branch in (and add the extra stuff that Zaynab is working on, and move on to the other two subworkflows).
Hi @alxndrdiaz . Priyanka's request was to remove the commit, e.g. through a rebase, rather than reverting it, so that the files retain the original creation attributes (date and author) in the Git history.
Also, this PR deletes more than just
create_bed.nf
. These are missing and the commit that deletes them needs to be removed:* docs/parameters.md * modules/nf-core/fastawindows/ * modules/nf-core/mosdepth * modules/nf-core/samtools/view * subworkflows/local/coverage_stats.nf * modules/local/input_tol.nf
The PR also modifies a lot of the existing files by removing existing code, documentation, configuration, etc, that Zaynab pushed to dev recently. I may have missed some, but there are
* workflows/blobtoolkit.nf * nextflow_schema.json * nextflow.config * docs/README.md * docs/output.md * docs/usage.md * conf/modules.config * conf/test_full.config * assets/email_template.html * assets/email_template.txt * assets/multiqc_config.yml * CITATIONS.md * .github/workflows/ci.yml
And some files unnecessarily modified:
* modules/local/create_bed.nf
I think what happened is that your commits remove those files and add them back to whatever stage they were at when you did the rebase. Because of that you're missing all the changes that Zaynab pushed recently. Can you please remove/rework those commits ? I think they should almost be no file/line deletion in your PR at all. The diff should be cleanly showing additions: new files and new lines of codes, typically modules, sub-workflows, the main workflow, modules.json. Maybe that'll be very hard to do and it'll be easier to forget about the existing commits and make new, clean, ones onto a new branch that branches off the current
dev
. But as it is, we can't merge the current merge branch in (and add the extra stuff that Zaynab is working on, and move on to the other two subworkflows).
Ok. Then I will create a new branch from dev
and then will open a new PR with the new commits. Also I will remove the old busco_subworkflow branch until the PR is merged.
To speed up the process, instead of cherry picking the commits from here to new branch, just copy paste the changes and push them as a new commit. That will make sure there are no conflicts or unexpected issues. Message on Slack if you need help.
To speed up the process, instead of cherry picking the commits from here to new branch, just copy paste the changes and push them as a new commit. That will make sure there are no conflicts or unexpected issues. Message on Slack if you need help.
There was one subworkflow output missing (a .tsv table from busco), I fixed it then created a new branch (busco_dev
) from dev
and added/modified the files required for busco subworkflow. New pull request: Busco dev #37
Supersed by #37
PR checklist
nf-core lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).