sanger-tol / blobtoolkit

Nextflow pipeline for BlobToolKit for Sanger ToL production suite
https://pipelines.tol.sanger.ac.uk/blobtoolkit
MIT License
11 stars 1 forks source link

Busco dev #37

Closed alxndrdiaz closed 1 year ago

alxndrdiaz commented 1 year ago

PR checklist

alxndrdiaz commented 1 year ago

Oh yeah, this looks so much better ! Great ! I can confirm that the test profile works for me on the farm, with just this small change below.

In terms of functionality of the subworkflow, do you think it does everything it needs to do ?

Results from the busco_diamond subworkflow can be found in thework/ directory but they are not exported to the results directory.

Adding the following lines to conf/modules.config:

 withName: BUSCO_DIAMOND {
        publishDir = [
            path: { "${params.outdir}/blobtoolkit" },
            mode: params.publish_dir_mode,
            saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
        ]
    }

doesn't solve the problem, the following Nextflow warnings are raised:

WARN: There's no process matching config selector: CUSTOM_DUMPSOFTWAREVERSIONS
WARN: There's no process matching config selector: BUSCO_DIAMOND
WARN: There's no process matching config selector: FASTQC

I need to take a closer look at this, not sure which other files might be causing this error.

muffato commented 1 year ago

withName only works with process names. BUSCO_DIAMOND is a sub-workflow name. Something like this may work, I think:

withName: '.*.*:BUSCO_DIAMOND:.*'
alxndrdiaz commented 1 year ago

withName only works with process names. BUSCO_DIAMOND is a sub-workflow name. Something like this may work, I think:

withName: '.*.*:BUSCO_DIAMOND:.*'

It worked, only results from TAR module which are only renamed and compressed files from BUSCO are excluded (these are only required for EXTRACT_BUSCO_GENES module and not used outside the subworkflow):

 withName: '.*.*:BUSCO_DIAMOND:GOAT_TAXONSEARCH|BUSCO|EXTRACT_BUSCO_GENES|DIAMOND_BLASTP' {
        publishDir = [
            path: { "${params.outdir}/blobtoolkit/busco_diamond" },
            mode: params.publish_dir_mode,
            saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
        ]
        }

When running:

nextflow run main.nf -profile test,singularity

The results folder should look something like this using tree -L 3 results/blobtoolkit/:

results/blobtoolkit/
├── busco_diamond
│   ├── GCA_922984935.2.subset-archaea_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-archaea_odb10-busco.batch_summary.txt
│   ├── GCA_922984935.2.subset-bacteria_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-bacteria_odb10-busco.batch_summary.txt
│   ├── GCA_922984935.2.subset_busco_genes.fasta
│   ├── GCA_922984935.2.subset-carnivora_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-carnivora_odb10-busco.batch_summary.txt
│   ├── GCA_922984935.2.subset-eukaryota_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-eukaryota_odb10-busco.batch_summary.txt
│   ├── GCA_922984935.2.subset-eutheria_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-eutheria_odb10-busco.batch_summary.txt
│   ├── GCA_922984935.2.subset-laurasiatheria_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-laurasiatheria_odb10-busco.batch_summary.txt
│   ├── GCA_922984935.2.subset-mammalia_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-mammalia_odb10-busco.batch_summary.txt
│   ├── GCA_922984935.2.subset-metazoa_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-metazoa_odb10-busco.batch_summary.txt
│   ├── GCA_922984935.2.subset-tetrapoda_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-tetrapoda_odb10-busco.batch_summary.txt
│   ├── GCA_922984935.2.subset.tsv
│   ├── GCA_922984935.2.subset-vertebrata_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-vertebrata_odb10-busco.batch_summary.txt
│   ├── short_summary.specific.archaea_odb10.GCA_922984935.2.subset.fasta.json
│   ├── short_summary.specific.archaea_odb10.GCA_922984935.2.subset.fasta.txt
│   ├── short_summary.specific.bacteria_odb10.GCA_922984935.2.subset.fasta.json
│   ├── short_summary.specific.bacteria_odb10.GCA_922984935.2.subset.fasta.txt
│   ├── short_summary.specific.carnivora_odb10.GCA_922984935.2.subset.fasta.json
│   ├── short_summary.specific.carnivora_odb10.GCA_922984935.2.subset.fasta.txt
│   ├── short_summary.specific.eukaryota_odb10.GCA_922984935.2.subset.fasta.json
│   ├── short_summary.specific.eukaryota_odb10.GCA_922984935.2.subset.fasta.txt
│   ├── short_summary.specific.eutheria_odb10.GCA_922984935.2.subset.fasta.json
│   ├── short_summary.specific.eutheria_odb10.GCA_922984935.2.subset.fasta.txt
│   ├── short_summary.specific.laurasiatheria_odb10.GCA_922984935.2.subset.fasta.json
│   ├── short_summary.specific.laurasiatheria_odb10.GCA_922984935.2.subset.fasta.txt
│   ├── short_summary.specific.mammalia_odb10.GCA_922984935.2.subset.fasta.json
│   ├── short_summary.specific.mammalia_odb10.GCA_922984935.2.subset.fasta.txt
│   ├── short_summary.specific.metazoa_odb10.GCA_922984935.2.subset.fasta.json
│   ├── short_summary.specific.metazoa_odb10.GCA_922984935.2.subset.fasta.txt
│   ├── short_summary.specific.tetrapoda_odb10.GCA_922984935.2.subset.fasta.json
│   ├── short_summary.specific.tetrapoda_odb10.GCA_922984935.2.subset.fasta.txt
│   ├── short_summary.specific.vertebrata_odb10.GCA_922984935.2.subset.fasta.json
│   └── short_summary.specific.vertebrata_odb10.GCA_922984935.2.subset.fasta.txt
├── mMelMel1_T1.mosdepth.global.dist.txt
├── mMelMel1_T1.mosdepth.region.dist.txt
├── mMelMel1_T1.mosdepth.summary.txt
├── mMelMel1_T1.per-base.bed.gz
├── mMelMel1_T1.per-base.bed.gz.csi
├── mMelMel1_T1.regions.bed.gz
└── mMelMel1_T1.regions.bed.gz.csi

31 directories, 39 files
alxndrdiaz commented 1 year ago

withName only works with process names. BUSCO_DIAMOND is a sub-workflow name. Something like this may work, I think:

withName: '.*.*:BUSCO_DIAMOND:.*'

It worked, only results from TAR module which are only renamed and compressed files from BUSCO are excluded (these are only required for EXTRACT_BUSCO_GENES module and not used outside the subworkflow):

 withName: '.*.*:BUSCO_DIAMOND:GOAT_TAXONSEARCH|BUSCO|EXTRACT_BUSCO_GENES|DIAMOND_BLASTP' {
        publishDir = [
            path: { "${params.outdir}/blobtoolkit/busco_diamond" },
            mode: params.publish_dir_mode,
            saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
        ]
        }

The results folder should look something like this using tree -L 3 results/blobtoolkit/:

results/blobtoolkit/
├── busco_diamond
│   ├── GCA_922984935.2.subset-archaea_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-archaea_odb10-busco.batch_summary.txt
│   ├── GCA_922984935.2.subset-bacteria_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-bacteria_odb10-busco.batch_summary.txt
│   ├── GCA_922984935.2.subset_busco_genes.fasta
│   ├── GCA_922984935.2.subset-carnivora_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-carnivora_odb10-busco.batch_summary.txt
│   ├── GCA_922984935.2.subset-eukaryota_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-eukaryota_odb10-busco.batch_summary.txt
│   ├── GCA_922984935.2.subset-eutheria_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-eutheria_odb10-busco.batch_summary.txt
│   ├── GCA_922984935.2.subset-laurasiatheria_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-laurasiatheria_odb10-busco.batch_summary.txt
│   ├── GCA_922984935.2.subset-mammalia_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-mammalia_odb10-busco.batch_summary.txt
│   ├── GCA_922984935.2.subset-metazoa_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-metazoa_odb10-busco.batch_summary.txt
│   ├── GCA_922984935.2.subset-tetrapoda_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-tetrapoda_odb10-busco.batch_summary.txt
│   ├── GCA_922984935.2.subset.tsv
│   ├── GCA_922984935.2.subset-vertebrata_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-vertebrata_odb10-busco.batch_summary.txt
│   ├── short_summary.specific.archaea_odb10.GCA_922984935.2.subset.fasta.json
│   ├── short_summary.specific.archaea_odb10.GCA_922984935.2.subset.fasta.txt
│   ├── short_summary.specific.bacteria_odb10.GCA_922984935.2.subset.fasta.json
│   ├── short_summary.specific.bacteria_odb10.GCA_922984935.2.subset.fasta.txt
│   ├── short_summary.specific.carnivora_odb10.GCA_922984935.2.subset.fasta.json
│   ├── short_summary.specific.carnivora_odb10.GCA_922984935.2.subset.fasta.txt
│   ├── short_summary.specific.eukaryota_odb10.GCA_922984935.2.subset.fasta.json
│   ├── short_summary.specific.eukaryota_odb10.GCA_922984935.2.subset.fasta.txt
│   ├── short_summary.specific.eutheria_odb10.GCA_922984935.2.subset.fasta.json
│   ├── short_summary.specific.eutheria_odb10.GCA_922984935.2.subset.fasta.txt
│   ├── short_summary.specific.laurasiatheria_odb10.GCA_922984935.2.subset.fasta.json
│   ├── short_summary.specific.laurasiatheria_odb10.GCA_922984935.2.subset.fasta.txt
│   ├── short_summary.specific.mammalia_odb10.GCA_922984935.2.subset.fasta.json
│   ├── short_summary.specific.mammalia_odb10.GCA_922984935.2.subset.fasta.txt
│   ├── short_summary.specific.metazoa_odb10.GCA_922984935.2.subset.fasta.json
│   ├── short_summary.specific.metazoa_odb10.GCA_922984935.2.subset.fasta.txt
│   ├── short_summary.specific.tetrapoda_odb10.GCA_922984935.2.subset.fasta.json
│   ├── short_summary.specific.tetrapoda_odb10.GCA_922984935.2.subset.fasta.txt
│   ├── short_summary.specific.vertebrata_odb10.GCA_922984935.2.subset.fasta.json
│   └── short_summary.specific.vertebrata_odb10.GCA_922984935.2.subset.fasta.txt
├── mMelMel1_T1.mosdepth.global.dist.txt
├── mMelMel1_T1.mosdepth.region.dist.txt
├── mMelMel1_T1.mosdepth.summary.txt
├── mMelMel1_T1.per-base.bed.gz
├── mMelMel1_T1.per-base.bed.gz.csi
├── mMelMel1_T1.regions.bed.gz
└── mMelMel1_T1.regions.bed.gz.csi

31 directories, 39 files
priyanka-surana commented 1 year ago

I would not worry much about publishing results to the results folder. Once the pipeline is completed we will update this with the final structure. For now as long as the code works and creates the correct output in the work folder we can move forward.

priyanka-surana commented 1 year ago

Are there any issues with the current code besides linting? If not, let’s merge. A lot of downstream work depends on this.

alxndrdiaz commented 1 year ago

Are there any issues with the current code besides linting? If not, let’s merge. A lot of downstream work depends on this.

Using there nf-core lint the following failed linting tests are reported:

╭─ [✗] 19 Pipeline Tests Failed ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                                                                                                                         │
│ nextflow_config: Config variable (incorrectly) found: params.enable_conda                                                                               │
│ nextflow_config: Config manifest.name did not begin with nf-core/: sanger-tol/blobtoolkit                                                               │
│ nextflow_config: Config variable manifest.homePage did not begin with https://github.com/nf-core/: https://github.com/sanger-tol/blobtoolkit            │
│ files_unchanged: .gitattributes does not match the template                                                                                             │
│ files_unchanged: LICENSE does not match the template                                                                                                    │
│ files_unchanged: .github/CONTRIBUTING.md does not match the template                                                                                    │
│ files_unchanged: .github/ISSUE_TEMPLATE/bug_report.yml does not match the template                                                                      │
│ files_unchanged: .github/ISSUE_TEMPLATE/feature_request.yml does not match the template                                                                 │
│ files_unchanged: .github/PULL_REQUEST_TEMPLATE.md does not match the template                                                                           │
│ files_unchanged: .github/workflows/branch.yml does not match the template                                                                               │
│ files_unchanged: .github/workflows/linting_comment.yml does not match the template                                                                      │
│ files_unchanged: .github/workflows/linting.yml does not match the template                                                                              │
│ files_unchanged: assets/email_template.txt does not match the template                                                                                  │
│ files_unchanged: assets/sendmail_template.txt does not match the template                                                                               │
│ files_unchanged: docs/README.md does not match the template                                                                                             │
│ files_unchanged: lib/NfcoreSchema.groovy does not match the template                                                                                    │
│ files_unchanged: lib/NfcoreTemplate.groovy does not match the template                                                                                  │
│ files_unchanged: .prettierignore does not match the template                                                                                            │
│ multiqc_config: 'assets/multiqc_config.yml' does not contain a matching 'report_comment'.                                                                                                 

The test using conf/test.config runs as expected and the output files are exported to the results folder.

zb32 commented 1 year ago

Hi it looks like there's a problem with the EXTRACT_BUSCO_GENES module. I ran the pipeline with the full BUSCO lineage datasets and DIAMOND_BLASTP still isn't running as the fast file from EXTRACT_BUSCO_GENES is empty. I've looked at the BUSCO results for the lineage eukaryota_odb10 and there are BUSCO hits so this file shouldn't be empty. It looks like in the original command it's looking for files ending in .faa but these are nested within another TAR archive.

/lustre/scratch123/tol/teams/tolit/users/zb3/blobtoolkit/work/49/62b3e7b8e51136b3ffa55ac66661e8/eukaryota_odb10/busco_sequences$ ls
fragmented_busco_sequences.tar.gz  multi_copy_busco_sequences.tar.gz  single_copy_busco_sequences.tar.gz 
alxndrdiaz commented 1 year ago

Hi it looks like there's a problem with the EXTRACT_BUSCO_GENES module. I ran the pipeline with the full BUSCO lineage datasets and DIAMOND_BLASTP still isn't running as the fast file from EXTRACT_BUSCO_GENES is empty. I've looked at the BUSCO results for the lineage eukaryota_odb10 and there are BUSCO hits so this file shouldn't be empty. It looks like in the original command it's looking for files ending in .faa but these are nested within another TAR archive.

/lustre/scratch123/tol/teams/tolit/users/zb3/blobtoolkit/work/49/62b3e7b8e51136b3ffa55ac66661e8/eukaryota_odb10/busco_sequences$ ls
fragmented_busco_sequences.tar.gz  multi_copy_busco_sequences.tar.gz  single_copy_busco_sequences.tar.gz 

It seems only the single_copy_busco_sequences.tar.gz file contains .faa files:

tar -tf single_copy_busco_sequences.tar.gz

Output:

single_copy_busco_sequences/
single_copy_busco_sequences/939345at2759.faa
single_copy_busco_sequences/939345at2759.fna
single_copy_busco_sequences/939345at2759.gff

Also the Python script you mentioned looks into each .tar.gz and searches for all ".faa" files inside (but it would be a good idea to confirm this). However as you mentioned there is at least one .faa in this case and the output FASTA file with extracted genes should contain this sequence. The module TAR prepares the input for EXTRACT_BUSCO_GENES module and includes a .tar.gz compression step, so it is possible that the issue is in that module instead, also I used the flag --tar for running busco that also compresses some of these folders in the busco output. Then I need to check how these folders are being compressed and see if I can fix the issue.

muffato commented 1 year ago

The archives single_copy_busco_sequences.tar.gz & co come from the --tar option we asked you to add to Busco. (in conf/modules.config) Didn't realise it would cause some trouble down the line. In order to get it to work, feel free to remove the --tar option, though this alone may not fix the issue.

alxndrdiaz commented 1 year ago

@zb32 Hi. I fixed the issue you found. When running the test there should be the following file containing the diamond blastp hits: results/blobtoolkit/busco_diamond/GCA_922984935.2.subset.txt, the content of this file looks like this:

OV277441.1:691847-695889=939345at2759=single    9838    979 OV277441.1:691847-695889=939345at2759=single    tr|A0A5N4CFV3|A0A5N4CFV3_CAMDR  64.6    867 124 8   1   815 1   736 0.0 979

Which is the expected output (columns: "qseqid staxids bitscore qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore").

alxndrdiaz commented 1 year ago

@muffato @priyanka-surana @priyanka-surana I was not sure about merging, if you have any comments or issues that should be fixed, please let me know.

muffato commented 1 year ago

Thank you @alxndrdiaz ! I can confirm that the Busco hit makes it way to Diamond on the unit test.

I've started a full test on gfLaeSulp1.1 (and had to do a few changes, which I have added to this branch). It's a small genome, so hopefully it shouldn't take too long. I'll talk to Zaynab tomorrow morning, but I think it will be OK to merge 🤞🏼

muffato commented 1 year ago

Your subworkflow actually already completed on the full test. 326 Busco genes recovered across the three domains, and 280 Diamond hits. It looks fine by me 👍🏼