sanger-tol / blobtoolkit

Nextflow DSL2 pipeline to generate data for a BlobToolKit analysis. This workflow is part of the Tree of Life production suite.
https://pipelines.tol.sanger.ac.uk/blobtoolkit
MIT License
11 stars 2 forks source link

Problems recognizing the BUSCO databases with nextflow BTK pipeline v0.6.0 #122

Open gitcruz opened 3 days ago

gitcruz commented 3 days ago

Description of the bug

Dear developers,

I downloaded and installed the pipeline v0.6.0.

As pointed out in the usage, I downloaded the entire busco v5 databases, untarred them. As I was having a recurrent error with BUSCO, after that I also decompressed the refseq_db.faa.gz for all dbs. However the error still persists and it looks like this:

_2024-11-14 12:32:41 ERROR: Unable to run BUSCO in offline mode. Dataset /scratch_tmp/32318106/nxf.LMCX5N46Un/lineages/lineages/viridiplantae_odb10 does not exist. mv: cannot stat 'tnRamLact8_Nhpy_mq10-viridiplantae_odb10-busco//short_summary..json': No such file or directory mv: cannot stat 'tnRamLact8_Nhpy_mq10-viridiplantae_odb10-busco//short_summary..txt': No such file or directory_

_Work dir: /scratch_isilon/groups/assembly/data/projects/BGE/tnRamLact/assembly/curation/nextdenovo.hypo1.purged.yahs_mq10/1_blobtoolkit/blobtoolkitnextflow/work/e3/910c9ec4ebbab08e742510c3a50ee8

I don't really know why is not finding the busco databases!!! all of them are stored here: /scratch_isilon/groups/assembly/data/databases/BUSCO_2024_11/v5/data/lineages/

This is my nextflow command:

_nextflow \ run /software/assembly/pipelines/nf-core-pipelines/blobtoolkit_sanger-tol/blobtoolkit-0.6.0/main.nf \ -c /software/assembly/pipelines/nf-core-pipelines/cluster_config/cnag_nextflow_queue.config \ -profile singularity \ --input tnRamLact8_samplesheet_s3.csv \ --outdir out \ --fasta tnRamLact8_Nhpy_mq10.fasta \ --taxon 947578 \ --align true \ --taxdump /scratch_isilon/groups/assembly/data/databases/taxdump_2024_10_01 \ --blastp /scratch_isilon/groups/assembly/data/databases/uniprot_2024_10_03/reference_proteomes.dmnd \ --blastx /scratch_isilon/groups/assembly/data/databases/uniprot_2024_10_03/reference_proteomes.dmnd \ --blastn /scratch_isilon/groups/assembly/data/databases/nt_2024_10_03 \ --busco /scratch_isilon/groups/assembly/data/databases/BUSCO_2024_11/v5/data/lineages/ \ --busco_lineages metazoa_odb10,viridiplantae_odb10,fungi_odb10,apicomplexa_odb10,euglenozoa_odb10,diptera_odb10,alphaproteobacteria_odb10,mycoplasmatales_odb10,proteobacteria_odb10,nematoda_odb10,rickettsialesodb10

I am attaching the full log and sbatch command so you can check it entirely. I would really appreciate if you can help me to overcome this error and get this pipeline running.

Thanks. btk_v0.6.0_nextflow.log run_blobtoolkit_v060_on_tnRamLact8.sbatch.txt

Command used and terminal output

No response

Relevant files

No response

System information

No response

muffato commented 2 days ago

Hi @gitcruz . Can you try with /scratch_isilon/groups/assembly/data/databases/BUSCO_2024_11/v5/data/, i.e. without the trailing lineages ?

The way we run the pipeline, --busco gets the path to a directory that contains lineages, cf:

├── information
├── lineages
│   ├── acidobacteria_odb10
│   │   ├── hmms
│   │   └── info
│   ├── aconoidasida_odb10
│   │   ├── hmms
│   │   ├── info
│   │   └── prfl
│   (...)
│   ├── viridiplantae_odb10
│   │   ├── hmms
│   │   ├── info
│   │   └── prfl
│   └── xanthomonadales_odb10
│       ├── hmms
│       └── info
└── placement_files

All the refseq_db.faa.gz have been decompressed already (like you did). I should mention that in the doc.

Matthieu

gitcruz commented 2 days ago

Thanks for the quick response Matthieu,

I'm trying it that way. So far the nextflow job has been running > 2hours

WRT the databases path is it necessary to add the final slash or not (i.e. --blastn /scratch_isilon/groups/assembly/data/databases/nt_2024_10_03/)?

And also I don't understand the guide examples for the diamond databases I just have one. While you show two: --blastp /path/to/buscogenes.dmnd --blastx /path/to/buscoregions.dmnd

I am using only one: --blastp /scratch_isilon/groups/assembly/data/databases/uniprot_2024_10_03/reference_proteomes.dmnd \ --blastx /scratch_isilon/groups/assembly/data/databases/uniprot_2024_10_03/reference_proteomes.dmnd \

Is this correct? I followed the guide and built just one diamond db...

Regards, Fernando

muffato commented 1 day ago

WRT the databases path is it necessary to add the final slash or not (i.e. --blastn /scratch_isilon/groups/assembly/data/databases/nt_2024_10_03/)?

I think it should work the same with and without.

And also I don't understand the guide examples for the diamond databases I just have one. While you show two: --blastp /path/to/buscogenes.dmnd --blastx /path/to/buscoregions.dmnd

I am using only one: --blastp /scratch_isilon/groups/assembly/data/databases/uniprot_2024_10_03/reference_proteomes.dmnd --blastx /scratch_isilon/groups/assembly/data/databases/uniprot_2024_10_03/reference_proteomes.dmnd \

Is this correct? I followed the guide and built just one diamond db...

Yes it's correct. Those are different parameters in case people want to use different databases. I could imagine someone optimising the pipeline by using a more restricted database for the blastp search (which happens first) in order to get the blastp jobs done quicker, while using the complete database for the blastx search (which happens after). In practice, the way we run it on all our assembled genomes, we use the same, complete, database for both.

Best, Matthieu