sanger-tol / blobtoolkit

Nextflow pipeline for BlobToolKit for Sanger ToL production suite
https://pipelines.tol.sanger.ac.uk/blobtoolkit
MIT License
10 stars 0 forks source link

Busco subworkflow clean #36

Closed alxndrdiaz closed 1 year ago

alxndrdiaz commented 1 year ago

PR checklist

alxndrdiaz commented 1 year ago

This is the new PR for busco subworkflow. I am still working on fixing the linting errors.

priyanka-surana commented 1 year ago

Can you please resolve the conflicts listed as well? I will run the tests for you and update with results.

alxndrdiaz commented 1 year ago

for error: actions_ci: '.github/workflows/ci.yml' does not check minimum NF version (nf-core linting / nf-core (push)) , there is a documentation that explains the format of this file: https://nf-co.re/tools/docs/latest/pipeline_lint_tests/actions_ci.html , it is also helpful to check an example file: https://github.com/nf-core/rnaseq/blob/master/.github/workflows/ci.yml

alxndrdiaz commented 1 year ago

for the error related to PythonBlack check: first install it using pip install black the from the directory containing the pipeline repository runblack ., this should fix file formatting, finally commit the changes.

alxndrdiaz commented 1 year ago

nextflow run . -profile test,singularity is failing with error:

ERROR: Validation of pipeline parameters failed!
* --input: string [mMelMel3] does not match pattern ^\S+\.csv$ (mMelMel3)

Might be related to how the input is passed in conf/test.config:

// Input test data
    input   = "${projectDir}/assets/samplesheet.csv"
    fasta = "https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/assembly/release/mMelMel3.1_paternal_haplotype/GCA_922984935.2.subset.fasta.gz"

Instead of:

 input   = "mMelMel3"
 project = ".sandbox"
priyanka-surana commented 1 year ago

For input both should work. I think you might need to edit the nextflow_schema or other groovy functions that check for a csv to make sure input allows both input formats. And the error you have above, seems to be triggered because you are passing input = mMelMel3 not the csv.

You will face issues with project_basedir. Earlier all projects were either based in scratch123 or hyperlinked to it, but now they are split into either scratch123 or scratch124 so you might consider changing how your input_check works.

For my pipelines, I ask for the full project directory now instead of just the project name. Example: https://github.com/sanger-tol/genomenote/blob/update_dev/bin/tol_input.sh

alxndrdiaz commented 1 year ago

For errors related to formatting of nextflow_schema.json, use a local installation of prettier and run:

npx prettier --write nextflow_schema.json

Then, to check that the format was updated:

npx prettier --check nextflow_schema.json

Finally commit the changes.

priyanka-surana commented 1 year ago

Have you tested the pipeline with both types of input? On gen3 you can start an interactive session and run your tests. If you want to submit jobs via bsub, do that via the interactive session too, since the head nodes are unavailable.

priyanka-surana commented 1 year ago

I tried testing the code nextflow run . -profile sanger,test,singularity but it fails.

Error executing process > 'NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:GOAT_TAXONSEARCH ([null, GCA_922984935.2.subset])'

I am pretty sure it has to do with the tag being in square brackets. LSF does not like that. The input to goat/taxonsearch is not being passed correctly. Can you please fix that?

alxndrdiaz commented 1 year ago

Have you tested the pipeline with both types of input? On gen3 you can start an interactive session and run your tests. If you want to submit jobs via bsub, do that via the interactive session too, since the head nodes are unavailable.

I tried from an interactive session (and also submitted a job but it seems that it does not start running), but got the following error:

Error executing process > 'NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:INPUT_CHECK:GUNZIP (GCA_922984935.2.subset.fasta.gz)'

Caused by:
  Failed to submit process to grid scheduler for execution

Command executed:
  bsub

Command exit status:
  255
....
alxndrdiaz commented 1 year ago

I tried testing the code nextflow run . -profile sanger,test,singularity but it fails.

Error executing process > 'NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:GOAT_TAXONSEARCH ([null, GCA_922984935.2.subset])'

I am pretty sure it has to do with the tag being in square brackets. LSF does not like that. The input to goat/taxonsearch is not being passed correctly. Can you please fix that?

It looks like the error is related to goat/taxonsearch input (see code below) as you mentioned. Similar transformations are used within busco_diamond_blastp.nf so I will have to check if they need to be changed as well.

GOAT_TAXONSEARCH (
    fasta.map { fa -> [fa[0], "${params.taxon}", "${params.taxa_file}" ? file("${params.taxa_file}") : []] }
    )

As I understood the brackets that might be causing the error are the ones that are used in [fa[0], "${params.taxon}", "${params.taxa_file}" ? file("${params.taxa_file}") : []], right?

priyanka-surana commented 1 year ago

The issue comes from the way fasta is passed in workflows/blobtoolkit.nf. The fasta channel you pass looks like this:

[[id:[null, GCA_922984935.2.subset]], [[id:GCA_922984935.2.subset], /lustre/scratch123/tol/teams/tolit/users/ps22/pipelines/blobtoolkit/work/93/d1f37d6d43a418a94f90c377d87016/GCA_922984935.2.subset.fasta]]

It should look like this:

[[id:GCA_922984935.2], /lustre/scratch123/tol/teams/tolit/users/ps22/pipelines/blobtoolkit/work/93/d1f37d6d43a418a94f90c377d87016/GCA_922984935.2.subset.fasta]

The channel was already in the correct format from input_check, so the modifications you added were breaking it. Use view() often and make sure you pass what you expect to pass.

priyanka-surana commented 1 year ago

Other issues:

  1. There was no versions for BUSCO in the workflow. You need to add that.
  2. Your workflow is also missing modulesdumpsoftwareversions please add that back.
  3. I would recommend you keep params to the workflow and only pass channels to the subworkflow.
  4. Your formatting of the workflow and subworkflow needs some work. Take a look at the nf-core pipelines or one of ours. Example: https://github.com/sanger-tol/genomenote/tree/dev
alxndrdiaz commented 1 year ago

The issue comes from the way fasta is passed in workflows/blobtoolkit.nf. The fasta channel you pass looks like this:

[[id:[null, GCA_922984935.2.subset]], [[id:GCA_922984935.2.subset], /lustre/scratch123/tol/teams/tolit/users/ps22/pipelines/blobtoolkit/work/93/d1f37d6d43a418a94f90c377d87016/GCA_922984935.2.subset.fasta]]

It should look like this:

[[id:GCA_922984935.2], /lustre/scratch123/tol/teams/tolit/users/ps22/pipelines/blobtoolkit/work/93/d1f37d6d43a418a94f90c377d87016/GCA_922984935.2.subset.fasta]

The channel was already in the correct format from input_check, so the modifications you added were breaking it. Use view() often and make sure you pass what you expect to pass.

Done 5ac0528 - alxndrdiaz, Tue Dec 6 18:28:48 2022 -0600 : fixed bad input in BUSCO_DIAMOND subworkflow With this change now the output of looks something like this:

nextflow run . -profile sanger,test,singularity

[c1/0f6328] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:INPUT_CHECK:GUNZIP (GCA_922984935.2.subset.fasta... [100%] 1 of 1 ✔
[0d/545f4b] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:INPUT_CHECK:SAMPLESHEET_CHECK (samplesheet.csv)     [100%] 1 of 1 ✔
[a4/2b018d] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:GOAT_TAXONSEARCH (GCA_922984935.2.... [100%] 1 of 1 ✔
[fc/6d94dd] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:BUSCO (GCA_922984935.2.subset)        [100%] 10 of 10 ✔
[9b/fd6b68] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:EXTRACT_BUSCO_GENES (GCA_922984935... [100%] 1 of 1 ✔
[bb/8d61c9] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:DIAMOND_BLASTP (GCA_922984935.2.su... [ 83%] 5 of 6, failed:
Error executing process > 'NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:DIAMOND_BLASTP (GCA_922984935.2.subset)'
alxndrdiaz commented 1 year ago

Other issues:

1. There was no versions for BUSCO in the workflow. You need to add that.

2. Your workflow is also missing modules`dumpsoftwareversions` please add that back.

3. I would recommend you keep `params` to the workflow and only pass channels to the subworkflow.

4. Your formatting of the workflow and subworkflow needs some work. Take a look at the nf-core pipelines or one of ours. Example: https://github.com/sanger-tol/genomenote/tree/dev

Commits for (1) and (2):

d3ddbd2 - alxndrdiaz, Tue Dec 6 19:42:16 2022 -0600 : added ch_versions for BUSCO_DIAMOND subworkflow
fe957f0 - alxndrdiaz, Tue Dec 6 19:51:54 2022 -0600 : added CUSTOM_DUMPSOFTWAREVERSIONS to workflow

I am still working on (3) and (4).

priyanka-surana commented 1 year ago

Regarding the error with DIAMOND_BLASTP:

  1. Tar the busco_sequences which has the path: ${busco_dir}/*/*/busco_sequences
  2. Pass it as a path argument to EXTRACT_BUSCO_GENES, this will stage busco_sequences in the same folder as the tsv files and should get the step working.
  3. While you are updating the EXTRACT_BUSCO_GENES also update the container to latest and add an error for conda, see here: https://github.com/nf-core/modules/blob/5e34754d42cd2d5d248ca8673c0a53cdf5624905/modules/nf-core/cellranger/mkref/main.nf
  4. Also add a check for empty fasta file after extract_busco_genes and then skip diamond_blastp

Let me know if this is unclear. We are definitely getting to the finish line. 👍

priyanka-surana commented 1 year ago

@alxndrdiaz Further clarifications: So the tar module should create the following structure:

archaea_odb10
-- full_table.tsv.gz
-- busco_sequences.tar.gz

Same for bacteria_odb10 and eukaryota_odb10. Then these three folders need to be staged in the work directory for extract_busco_genes.

The input will then be:

btk pipeline extract-busco-genes \
    --busco archaea_odb10/full_table.tsv.gz \
    --busco bacteria_odb10/full_table.tsv.gz \
    --busco eukaryota_odb10/full_table.tsv.gz \
    --out GCA_922984935.2.subset_busco_genes.fasta

The folder name is important because then the extract_busco_genes.py script looks for archaea_odb10/busco_sequences.tar.gz to create the fasta output. Does that make sense?

alxndrdiaz commented 1 year ago

/full_table.tsv.gz

I already added the changes to prepare the input for the module (arc, bac, euk are paths from tar module) but now there is a Nextflow error:

btk pipeline extract-busco-genes \\
        --busco $arc/full_table.tsv.gz \\
        --busco $bac/full_table.tsv.gz \\
        --busco $euk/full_table.tsv.gz \\
        --out ${prefix}_busco_genes.fasta

Error message:

Error executing process > 
'NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:EXTRACT_BUSCO_GENES (GCA_922984935.2.subset)'

Caused by:
  Process `NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:EXTRACT_BUSCO_GENES` input file name collision -- There are multiple input files for each of the following file names: busco_sequences.tar.gz, full_table.tsv.gz

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
alxndrdiaz commented 1 year ago

Regarding the error with DIAMOND_BLASTP:

1. Tar the `busco_sequences` which has the path: `${busco_dir}/*/*/busco_sequences`

2. Pass it as a `path` argument to `EXTRACT_BUSCO_GENES`, this will stage `busco_sequences` in the same folder as the `tsv` files and should get the step working.

3. While you are updating the `EXTRACT_BUSCO_GENES` also update the container to latest and add an error for conda, see here: https://github.com/nf-core/modules/blob/5e34754d42cd2d5d248ca8673c0a53cdf5624905/modules/nf-core/cellranger/mkref/main.nf

4. Also add a check for empty `fasta` file after `extract_busco_genes` and then skip `diamond_blastp`

Let me know if this is unclear. We are definitely getting to the finish line. +1

This is the condition used to process fasta files from extract_busco_genes:

    // runs DIAMOND_BLASTP if fasta file from EXTRACT_BUSCO_GENE is not empty

    empty_fasta = EXTRACT_BUSCO_GENES.out.fasta.map { meta,p -> file("$p").isEmpty() } 

    if ( empty_fasta == false ) {
    dmd_input_ch = EXTRACT_BUSCO_GENES.out.fasta     
    }
    else {
    dmd_input_ch = Channel.empty()
    }

    DIAMOND_BLASTP (
    dmd_input_ch,
    blastp_db,
    "${params.blastp_outext}",
    "${params.blastp_cols}"
    )
    ch_versions = ch_versions.mix(DIAMOND_BLASTP.out.versions)

commits:

428cbe3 - @alxndrdiaz, Thu Jan 12 20:25:45 2023 -0600 : emit blastp_txt = DIAMOND_BLASTP.out.txt
52a3799 - @alxndrdiaz, Thu Jan 12 20:19:20 2023 -0600 : DIAMOND_BLASTP input channel is evaluated before eecution
priyanka-surana commented 1 year ago

Pipeline runs successfully with test profile but it does not even execute DIAMOND_BLASTP or even lists that as a process. Even if the extract_busco_genes output is empty, it should list the process and then skip it.

[38/fa2d50] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:INPUT_CHECK:GUNZIP (GCA_922984935.2.subset.fasta.gz)       [100%] 1 of 1 ✔
[44/b97801] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:INPUT_CHECK:SAMPLESHEET_CHECK (samplesheet.csv)            [100%] 1 of 1 ✔
[37/8a65b8] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:GOAT_TAXONSEARCH (GCA_922984935.2.subset)    [100%] 1 of 1 ✔
[64/8a1e9b] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:BUSCO (GCA_922984935.2.subset)               [100%] 10 of 10 ✔
[55/5f7c32] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:TAR (GCA_922984935.2.subset)                 [100%] 1 of 1 ✔
[ca/fea9bd] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:EXTRACT_BUSCO_GENES (GCA_922984935.2.subset) [100%] 1 of 1 ✔
[3d/848c61] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:CUSTOM_DUMPSOFTWAREVERSIONS (1)                            [100%] 1 of 1 ✔
-[nf-core/blobtoolkit] Pipeline completed successfully-
alxndrdiaz commented 1 year ago

Pipeline runs successfully with test profile but it does not even execute DIAMOND_BLASTP or even lists that as a process. Even if the extract_busco_genes output is empty, it should list the process and then skip it.

[38/fa2d50] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:INPUT_CHECK:GUNZIP (GCA_922984935.2.subset.fasta.gz)       [100%] 1 of 1 ✔
[44/b97801] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:INPUT_CHECK:SAMPLESHEET_CHECK (samplesheet.csv)            [100%] 1 of 1 ✔
[37/8a65b8] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:GOAT_TAXONSEARCH (GCA_922984935.2.subset)    [100%] 1 of 1 ✔
[64/8a1e9b] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:BUSCO (GCA_922984935.2.subset)               [100%] 10 of 10 ✔
[55/5f7c32] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:TAR (GCA_922984935.2.subset)                 [100%] 1 of 1 ✔
[ca/fea9bd] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:EXTRACT_BUSCO_GENES (GCA_922984935.2.subset) [100%] 1 of 1 ✔
[3d/848c61] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:CUSTOM_DUMPSOFTWAREVERSIONS (1)                            [100%] 1 of 1 ✔
-[nf-core/blobtoolkit] Pipeline completed successfully-

It should look different now:

[e8/f8c663] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:INPUT_CHECK:GUNZIP (GCA_922984935.2.subset.fasta.gz)       [100%] 1 of 1 ✔
[47/a82249] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:INPUT_CHECK:SAMPLESHEET_CHECK (samplesheet.csv)            [100%] 1 of 1 ✔
[cc/0ef692] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:GOAT_TAXONSEARCH (GCA_922984935.2.subset)    [100%] 1 of 1 ✔
[fc/5ed408] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:BUSCO (GCA_922984935.2.subset)               [100%] 10 of 10 ✔
[6d/15ee31] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:TAR (GCA_922984935.2.subset)                 [100%] 1 of 1 ✔
[a1/a9bc34] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:EXTRACT_BUSCO_GENES (GCA_922984935.2.subset) [100%] 1 of 1 ✔
[-        ] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:DIAMOND_BLASTP                               -
[2e/2b480f] process > NFCORE_BLOBTOOLKIT:BLOBTOOLKIT:CUSTOM_DUMPSOFTWAREVERSIONS (1)                            [100%] 1 of 1 ✔
alxndrdiaz commented 1 year ago

Can you undo the deletes you made related to the coverage subworkflow? When you merge, we want both of the subworkflows working. Preferably, reset the deletions, rather than adding them back, that way we maintain proper credits.

I could identify only one commit that was a file deletion from a commit related to the coverage subworkflow:

79f2b27 - alxndrdiaz, Mon Nov 28 20:19:31 2022 -0600 : deleted unused module create_bed.nf

To revert this commit:

git revert 79f2b27

Output:

[busco_clean 45bdd63] Revert "deleted unused module create_bed.nf"
 1 file changed, 31 insertions(+)
 create mode 100644 modules/local/create_bed.nf
muffato commented 1 year ago

Hi @alxndrdiaz . Priyanka's request was to remove the commit, e.g. through a rebase, rather than reverting it, so that the files retain the original creation attributes (date and author) in the Git history.

Also, this PR deletes more than just create_bed.nf. These are missing and the commit that deletes them needs to be removed:

The PR also modifies a lot of the existing files by removing existing code, documentation, configuration, etc, that Zaynab pushed to dev recently. I may have missed some, but there are

And some files unnecessarily modified:

I think what happened is that your commits remove those files and add them back to whatever stage they were at when you did the rebase. Because of that you're missing all the changes that Zaynab pushed recently. Can you please remove/rework those commits ? I think they should almost be no file/line deletion in your PR at all. The diff should be cleanly showing additions: new files and new lines of codes, typically modules, sub-workflows, the main workflow, modules.json. Maybe that'll be very hard to do and it'll be easier to forget about the existing commits and make new, clean, ones onto a new branch that branches off the current dev. But as it is, we can't merge the current merge branch in (and add the extra stuff that Zaynab is working on, and move on to the other two subworkflows).

alxndrdiaz commented 1 year ago

Hi @alxndrdiaz . Priyanka's request was to remove the commit, e.g. through a rebase, rather than reverting it, so that the files retain the original creation attributes (date and author) in the Git history.

Also, this PR deletes more than just create_bed.nf. These are missing and the commit that deletes them needs to be removed:

* docs/parameters.md

* modules/nf-core/fastawindows/

* modules/nf-core/mosdepth

* modules/nf-core/samtools/view

* subworkflows/local/coverage_stats.nf

* modules/local/input_tol.nf

The PR also modifies a lot of the existing files by removing existing code, documentation, configuration, etc, that Zaynab pushed to dev recently. I may have missed some, but there are

* workflows/blobtoolkit.nf

* nextflow_schema.json

* nextflow.config

* docs/README.md

* docs/output.md

* docs/usage.md

* conf/modules.config

* conf/test_full.config

* assets/email_template.html

* assets/email_template.txt

* assets/multiqc_config.yml

* CITATIONS.md

* .github/workflows/ci.yml

And some files unnecessarily modified:

* modules/local/create_bed.nf

I think what happened is that your commits remove those files and add them back to whatever stage they were at when you did the rebase. Because of that you're missing all the changes that Zaynab pushed recently. Can you please remove/rework those commits ? I think they should almost be no file/line deletion in your PR at all. The diff should be cleanly showing additions: new files and new lines of codes, typically modules, sub-workflows, the main workflow, modules.json. Maybe that'll be very hard to do and it'll be easier to forget about the existing commits and make new, clean, ones onto a new branch that branches off the current dev. But as it is, we can't merge the current merge branch in (and add the extra stuff that Zaynab is working on, and move on to the other two subworkflows).

Ok. Then I will create a new branch from dev and then will open a new PR with the new commits. Also I will remove the old busco_subworkflow branch until the PR is merged.

priyanka-surana commented 1 year ago

To speed up the process, instead of cherry picking the commits from here to new branch, just copy paste the changes and push them as a new commit. That will make sure there are no conflicts or unexpected issues. Message on Slack if you need help.

alxndrdiaz commented 1 year ago

To speed up the process, instead of cherry picking the commits from here to new branch, just copy paste the changes and push them as a new commit. That will make sure there are no conflicts or unexpected issues. Message on Slack if you need help.

There was one subworkflow output missing (a .tsv table from busco), I fixed it then created a new branch (busco_dev) from dev and added/modified the files required for busco subworkflow. New pull request: Busco dev #37

muffato commented 1 year ago

Supersed by #37