Closed alxndrdiaz closed 1 year ago
Oh yeah, this looks so much better ! Great ! I can confirm that the test profile works for me on the farm, with just this small change below.
In terms of functionality of the subworkflow, do you think it does everything it needs to do ?
Results from the busco_diamond
subworkflow can be found in thework/
directory but they are not exported to the results directory.
Adding the following lines to conf/modules.config
:
withName: BUSCO_DIAMOND {
publishDir = [
path: { "${params.outdir}/blobtoolkit" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
}
doesn't solve the problem, the following Nextflow warnings are raised:
WARN: There's no process matching config selector: CUSTOM_DUMPSOFTWAREVERSIONS
WARN: There's no process matching config selector: BUSCO_DIAMOND
WARN: There's no process matching config selector: FASTQC
I need to take a closer look at this, not sure which other files might be causing this error.
withName
only works with process names. BUSCO_DIAMOND
is a sub-workflow name. Something like this may work, I think:
withName: '.*.*:BUSCO_DIAMOND:.*'
withName
only works with process names.BUSCO_DIAMOND
is a sub-workflow name. Something like this may work, I think:withName: '.*.*:BUSCO_DIAMOND:.*'
It worked, only results from TAR
module which are only renamed and compressed files from BUSCO
are excluded (these are only required for EXTRACT_BUSCO_GENES
module and not used outside the subworkflow):
withName: '.*.*:BUSCO_DIAMOND:GOAT_TAXONSEARCH|BUSCO|EXTRACT_BUSCO_GENES|DIAMOND_BLASTP' {
publishDir = [
path: { "${params.outdir}/blobtoolkit/busco_diamond" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
}
When running:
nextflow run main.nf -profile test,singularity
The results folder should look something like this using tree -L 3 results/blobtoolkit/
:
results/blobtoolkit/
├── busco_diamond
│ ├── GCA_922984935.2.subset-archaea_odb10-busco
│ │ ├── GCA_922984935.2.subset.fasta
│ │ └── logs
│ ├── GCA_922984935.2.subset-archaea_odb10-busco.batch_summary.txt
│ ├── GCA_922984935.2.subset-bacteria_odb10-busco
│ │ ├── GCA_922984935.2.subset.fasta
│ │ └── logs
│ ├── GCA_922984935.2.subset-bacteria_odb10-busco.batch_summary.txt
│ ├── GCA_922984935.2.subset_busco_genes.fasta
│ ├── GCA_922984935.2.subset-carnivora_odb10-busco
│ │ ├── GCA_922984935.2.subset.fasta
│ │ └── logs
│ ├── GCA_922984935.2.subset-carnivora_odb10-busco.batch_summary.txt
│ ├── GCA_922984935.2.subset-eukaryota_odb10-busco
│ │ ├── GCA_922984935.2.subset.fasta
│ │ └── logs
│ ├── GCA_922984935.2.subset-eukaryota_odb10-busco.batch_summary.txt
│ ├── GCA_922984935.2.subset-eutheria_odb10-busco
│ │ ├── GCA_922984935.2.subset.fasta
│ │ └── logs
│ ├── GCA_922984935.2.subset-eutheria_odb10-busco.batch_summary.txt
│ ├── GCA_922984935.2.subset-laurasiatheria_odb10-busco
│ │ ├── GCA_922984935.2.subset.fasta
│ │ └── logs
│ ├── GCA_922984935.2.subset-laurasiatheria_odb10-busco.batch_summary.txt
│ ├── GCA_922984935.2.subset-mammalia_odb10-busco
│ │ ├── GCA_922984935.2.subset.fasta
│ │ └── logs
│ ├── GCA_922984935.2.subset-mammalia_odb10-busco.batch_summary.txt
│ ├── GCA_922984935.2.subset-metazoa_odb10-busco
│ │ ├── GCA_922984935.2.subset.fasta
│ │ └── logs
│ ├── GCA_922984935.2.subset-metazoa_odb10-busco.batch_summary.txt
│ ├── GCA_922984935.2.subset-tetrapoda_odb10-busco
│ │ ├── GCA_922984935.2.subset.fasta
│ │ └── logs
│ ├── GCA_922984935.2.subset-tetrapoda_odb10-busco.batch_summary.txt
│ ├── GCA_922984935.2.subset.tsv
│ ├── GCA_922984935.2.subset-vertebrata_odb10-busco
│ │ ├── GCA_922984935.2.subset.fasta
│ │ └── logs
│ ├── GCA_922984935.2.subset-vertebrata_odb10-busco.batch_summary.txt
│ ├── short_summary.specific.archaea_odb10.GCA_922984935.2.subset.fasta.json
│ ├── short_summary.specific.archaea_odb10.GCA_922984935.2.subset.fasta.txt
│ ├── short_summary.specific.bacteria_odb10.GCA_922984935.2.subset.fasta.json
│ ├── short_summary.specific.bacteria_odb10.GCA_922984935.2.subset.fasta.txt
│ ├── short_summary.specific.carnivora_odb10.GCA_922984935.2.subset.fasta.json
│ ├── short_summary.specific.carnivora_odb10.GCA_922984935.2.subset.fasta.txt
│ ├── short_summary.specific.eukaryota_odb10.GCA_922984935.2.subset.fasta.json
│ ├── short_summary.specific.eukaryota_odb10.GCA_922984935.2.subset.fasta.txt
│ ├── short_summary.specific.eutheria_odb10.GCA_922984935.2.subset.fasta.json
│ ├── short_summary.specific.eutheria_odb10.GCA_922984935.2.subset.fasta.txt
│ ├── short_summary.specific.laurasiatheria_odb10.GCA_922984935.2.subset.fasta.json
│ ├── short_summary.specific.laurasiatheria_odb10.GCA_922984935.2.subset.fasta.txt
│ ├── short_summary.specific.mammalia_odb10.GCA_922984935.2.subset.fasta.json
│ ├── short_summary.specific.mammalia_odb10.GCA_922984935.2.subset.fasta.txt
│ ├── short_summary.specific.metazoa_odb10.GCA_922984935.2.subset.fasta.json
│ ├── short_summary.specific.metazoa_odb10.GCA_922984935.2.subset.fasta.txt
│ ├── short_summary.specific.tetrapoda_odb10.GCA_922984935.2.subset.fasta.json
│ ├── short_summary.specific.tetrapoda_odb10.GCA_922984935.2.subset.fasta.txt
│ ├── short_summary.specific.vertebrata_odb10.GCA_922984935.2.subset.fasta.json
│ └── short_summary.specific.vertebrata_odb10.GCA_922984935.2.subset.fasta.txt
├── mMelMel1_T1.mosdepth.global.dist.txt
├── mMelMel1_T1.mosdepth.region.dist.txt
├── mMelMel1_T1.mosdepth.summary.txt
├── mMelMel1_T1.per-base.bed.gz
├── mMelMel1_T1.per-base.bed.gz.csi
├── mMelMel1_T1.regions.bed.gz
└── mMelMel1_T1.regions.bed.gz.csi
31 directories, 39 files
withName
only works with process names.BUSCO_DIAMOND
is a sub-workflow name. Something like this may work, I think:withName: '.*.*:BUSCO_DIAMOND:.*'
It worked, only results from TAR
module which are only renamed and compressed files from BUSCO
are excluded (these are only required for EXTRACT_BUSCO_GENES
module and not used outside the subworkflow):
withName: '.*.*:BUSCO_DIAMOND:GOAT_TAXONSEARCH|BUSCO|EXTRACT_BUSCO_GENES|DIAMOND_BLASTP' {
publishDir = [
path: { "${params.outdir}/blobtoolkit/busco_diamond" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
}
The results folder should look something like this using tree -L 3 results/blobtoolkit/
:
results/blobtoolkit/
├── busco_diamond
│ ├── GCA_922984935.2.subset-archaea_odb10-busco
│ │ ├── GCA_922984935.2.subset.fasta
│ │ └── logs
│ ├── GCA_922984935.2.subset-archaea_odb10-busco.batch_summary.txt
│ ├── GCA_922984935.2.subset-bacteria_odb10-busco
│ │ ├── GCA_922984935.2.subset.fasta
│ │ └── logs
│ ├── GCA_922984935.2.subset-bacteria_odb10-busco.batch_summary.txt
│ ├── GCA_922984935.2.subset_busco_genes.fasta
│ ├── GCA_922984935.2.subset-carnivora_odb10-busco
│ │ ├── GCA_922984935.2.subset.fasta
│ │ └── logs
│ ├── GCA_922984935.2.subset-carnivora_odb10-busco.batch_summary.txt
│ ├── GCA_922984935.2.subset-eukaryota_odb10-busco
│ │ ├── GCA_922984935.2.subset.fasta
│ │ └── logs
│ ├── GCA_922984935.2.subset-eukaryota_odb10-busco.batch_summary.txt
│ ├── GCA_922984935.2.subset-eutheria_odb10-busco
│ │ ├── GCA_922984935.2.subset.fasta
│ │ └── logs
│ ├── GCA_922984935.2.subset-eutheria_odb10-busco.batch_summary.txt
│ ├── GCA_922984935.2.subset-laurasiatheria_odb10-busco
│ │ ├── GCA_922984935.2.subset.fasta
│ │ └── logs
│ ├── GCA_922984935.2.subset-laurasiatheria_odb10-busco.batch_summary.txt
│ ├── GCA_922984935.2.subset-mammalia_odb10-busco
│ │ ├── GCA_922984935.2.subset.fasta
│ │ └── logs
│ ├── GCA_922984935.2.subset-mammalia_odb10-busco.batch_summary.txt
│ ├── GCA_922984935.2.subset-metazoa_odb10-busco
│ │ ├── GCA_922984935.2.subset.fasta
│ │ └── logs
│ ├── GCA_922984935.2.subset-metazoa_odb10-busco.batch_summary.txt
│ ├── GCA_922984935.2.subset-tetrapoda_odb10-busco
│ │ ├── GCA_922984935.2.subset.fasta
│ │ └── logs
│ ├── GCA_922984935.2.subset-tetrapoda_odb10-busco.batch_summary.txt
│ ├── GCA_922984935.2.subset.tsv
│ ├── GCA_922984935.2.subset-vertebrata_odb10-busco
│ │ ├── GCA_922984935.2.subset.fasta
│ │ └── logs
│ ├── GCA_922984935.2.subset-vertebrata_odb10-busco.batch_summary.txt
│ ├── short_summary.specific.archaea_odb10.GCA_922984935.2.subset.fasta.json
│ ├── short_summary.specific.archaea_odb10.GCA_922984935.2.subset.fasta.txt
│ ├── short_summary.specific.bacteria_odb10.GCA_922984935.2.subset.fasta.json
│ ├── short_summary.specific.bacteria_odb10.GCA_922984935.2.subset.fasta.txt
│ ├── short_summary.specific.carnivora_odb10.GCA_922984935.2.subset.fasta.json
│ ├── short_summary.specific.carnivora_odb10.GCA_922984935.2.subset.fasta.txt
│ ├── short_summary.specific.eukaryota_odb10.GCA_922984935.2.subset.fasta.json
│ ├── short_summary.specific.eukaryota_odb10.GCA_922984935.2.subset.fasta.txt
│ ├── short_summary.specific.eutheria_odb10.GCA_922984935.2.subset.fasta.json
│ ├── short_summary.specific.eutheria_odb10.GCA_922984935.2.subset.fasta.txt
│ ├── short_summary.specific.laurasiatheria_odb10.GCA_922984935.2.subset.fasta.json
│ ├── short_summary.specific.laurasiatheria_odb10.GCA_922984935.2.subset.fasta.txt
│ ├── short_summary.specific.mammalia_odb10.GCA_922984935.2.subset.fasta.json
│ ├── short_summary.specific.mammalia_odb10.GCA_922984935.2.subset.fasta.txt
│ ├── short_summary.specific.metazoa_odb10.GCA_922984935.2.subset.fasta.json
│ ├── short_summary.specific.metazoa_odb10.GCA_922984935.2.subset.fasta.txt
│ ├── short_summary.specific.tetrapoda_odb10.GCA_922984935.2.subset.fasta.json
│ ├── short_summary.specific.tetrapoda_odb10.GCA_922984935.2.subset.fasta.txt
│ ├── short_summary.specific.vertebrata_odb10.GCA_922984935.2.subset.fasta.json
│ └── short_summary.specific.vertebrata_odb10.GCA_922984935.2.subset.fasta.txt
├── mMelMel1_T1.mosdepth.global.dist.txt
├── mMelMel1_T1.mosdepth.region.dist.txt
├── mMelMel1_T1.mosdepth.summary.txt
├── mMelMel1_T1.per-base.bed.gz
├── mMelMel1_T1.per-base.bed.gz.csi
├── mMelMel1_T1.regions.bed.gz
└── mMelMel1_T1.regions.bed.gz.csi
31 directories, 39 files
I would not worry much about publishing results to the results folder. Once the pipeline is completed we will update this with the final structure. For now as long as the code works and creates the correct output in the work folder we can move forward.
Are there any issues with the current code besides linting? If not, let’s merge. A lot of downstream work depends on this.
Are there any issues with the current code besides linting? If not, let’s merge. A lot of downstream work depends on this.
Using there nf-core lint
the following failed linting tests are reported:
╭─ [✗] 19 Pipeline Tests Failed ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ │
│ nextflow_config: Config variable (incorrectly) found: params.enable_conda │
│ nextflow_config: Config manifest.name did not begin with nf-core/: sanger-tol/blobtoolkit │
│ nextflow_config: Config variable manifest.homePage did not begin with https://github.com/nf-core/: https://github.com/sanger-tol/blobtoolkit │
│ files_unchanged: .gitattributes does not match the template │
│ files_unchanged: LICENSE does not match the template │
│ files_unchanged: .github/CONTRIBUTING.md does not match the template │
│ files_unchanged: .github/ISSUE_TEMPLATE/bug_report.yml does not match the template │
│ files_unchanged: .github/ISSUE_TEMPLATE/feature_request.yml does not match the template │
│ files_unchanged: .github/PULL_REQUEST_TEMPLATE.md does not match the template │
│ files_unchanged: .github/workflows/branch.yml does not match the template │
│ files_unchanged: .github/workflows/linting_comment.yml does not match the template │
│ files_unchanged: .github/workflows/linting.yml does not match the template │
│ files_unchanged: assets/email_template.txt does not match the template │
│ files_unchanged: assets/sendmail_template.txt does not match the template │
│ files_unchanged: docs/README.md does not match the template │
│ files_unchanged: lib/NfcoreSchema.groovy does not match the template │
│ files_unchanged: lib/NfcoreTemplate.groovy does not match the template │
│ files_unchanged: .prettierignore does not match the template │
│ multiqc_config: 'assets/multiqc_config.yml' does not contain a matching 'report_comment'.
The test using conf/test.config
runs as expected and the output files are exported to the results folder.
Hi it looks like there's a problem with the EXTRACT_BUSCO_GENES
module. I ran the pipeline with the full BUSCO lineage datasets and DIAMOND_BLASTP
still isn't running as the fast file from EXTRACT_BUSCO_GENES
is empty. I've looked at the BUSCO results for the lineage eukaryota_odb10 and there are BUSCO hits so this file shouldn't be empty. It looks like in the original command it's looking for files ending in .faa but these are nested within another TAR archive.
/lustre/scratch123/tol/teams/tolit/users/zb3/blobtoolkit/work/49/62b3e7b8e51136b3ffa55ac66661e8/eukaryota_odb10/busco_sequences$ ls
fragmented_busco_sequences.tar.gz multi_copy_busco_sequences.tar.gz single_copy_busco_sequences.tar.gz
Hi it looks like there's a problem with the
EXTRACT_BUSCO_GENES
module. I ran the pipeline with the full BUSCO lineage datasets andDIAMOND_BLASTP
still isn't running as the fast file fromEXTRACT_BUSCO_GENES
is empty. I've looked at the BUSCO results for the lineage eukaryota_odb10 and there are BUSCO hits so this file shouldn't be empty. It looks like in the original command it's looking for files ending in .faa but these are nested within another TAR archive./lustre/scratch123/tol/teams/tolit/users/zb3/blobtoolkit/work/49/62b3e7b8e51136b3ffa55ac66661e8/eukaryota_odb10/busco_sequences$ ls fragmented_busco_sequences.tar.gz multi_copy_busco_sequences.tar.gz single_copy_busco_sequences.tar.gz
It seems only the single_copy_busco_sequences.tar.gz
file contains .faa files:
tar -tf single_copy_busco_sequences.tar.gz
Output:
single_copy_busco_sequences/
single_copy_busco_sequences/939345at2759.faa
single_copy_busco_sequences/939345at2759.fna
single_copy_busco_sequences/939345at2759.gff
Also the Python script you mentioned looks into each .tar.gz
and searches for all ".faa" files inside (but it would be a good idea to confirm this). However as you mentioned there is at least one .faa
in this case and the output FASTA file with extracted genes should contain this sequence. The module TAR
prepares the input for EXTRACT_BUSCO_GENES
module and includes a .tar.gz
compression step, so it is possible that the issue is in that module instead, also I used the flag --tar
for running busco
that also compresses some of these folders in the busco
output. Then I need to check how these folders are being compressed and see if I can fix the issue.
The archives single_copy_busco_sequences.tar.gz
& co come from the --tar
option we asked you to add to Busco. (in conf/modules.config
) Didn't realise it would cause some trouble down the line. In order to get it to work, feel free to remove the --tar
option, though this alone may not fix the issue.
@zb32 Hi. I fixed the issue you found. When running the test there should be the following file containing the diamond blastp
hits: results/blobtoolkit/busco_diamond/GCA_922984935.2.subset.txt
, the content of this file looks like this:
OV277441.1:691847-695889=939345at2759=single 9838 979 OV277441.1:691847-695889=939345at2759=single tr|A0A5N4CFV3|A0A5N4CFV3_CAMDR 64.6 867 124 8 1 815 1 736 0.0 979
Which is the expected output (columns: "qseqid staxids bitscore qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore").
@muffato @priyanka-surana @priyanka-surana I was not sure about merging, if you have any comments or issues that should be fixed, please let me know.
Thank you @alxndrdiaz ! I can confirm that the Busco hit makes it way to Diamond on the unit test.
I've started a full test on gfLaeSulp1.1 (and had to do a few changes, which I have added to this branch). It's a small genome, so hopefully it shouldn't take too long. I'll talk to Zaynab tomorrow morning, but I think it will be OK to merge 🤞🏼
Your subworkflow actually already completed on the full test. 326 Busco genes recovered across the three domains, and 280 Diamond hits. It looks fine by me 👍🏼
PR checklist
nf-core lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).