jtamames / SqueezeMeta

A complete pipeline for metagenomic analysis
GNU General Public License v3.0
346 stars 81 forks source link

Ways to speed up runtime #850

Open Vincev454 opened 3 weeks ago

Vincev454 commented 3 weeks ago

Dear authors,

First of all, thank you for creating SqueezeMeta.

In your opinion, what would be the best ways/options to speed up SqueezeMeta execution? I tested it on two paired fastq.gz files of around 5G each, and it took 24h to complete the first 13 steps (it stops before step14 (binning)) on a server with 250G of memory and 64 cores; Is this the expected execution runtime according to you?

Many thanks, Vince

fpusan commented 3 weeks ago

That feels slightly (but not crazily) slow but it is hard to tell since execution time will be very dependent on metagenome complexity. I suspect that the step taking the most time will be the annotation against the NCBIs nr database that we perform in step 4. Could you confirm this by looking at the syslog output? Depending on what you want to achieve there would be ways to speed things up by losing some accuracy in taxonomic assignment and/or restricting the analysis to MAGs instead of all the contigs.

Vincev454 commented 3 weeks ago

Thanks for your reply Fernando!

I guess restricting anlaysis to MAGs would be appropriate since I'm mainly interested in finding MAGs abundance in my samples.

Here is a simplified version of the syslog output:

`[[0mRun started Sat Jun 8 16:04:28 2024 in sequential mode 1 metagenomes found: 091_V2 ^[[34m[21 seconds]: STEP1 -> RUNNING ASSEMBLY: 01.run_all_assemblies.pl (megahit) Number of contigs: 335240 ^[[34m[1 hours, 42 minutes, 38 seconds]: STEP2 -> RNA PREDICTION: 02.rnas.pl

^[[34m[1 hours, 47 minutes, 26 seconds]: STEP3 -> ORF PREDICTION: 03.run_prodigal.pl ORFs predicted: 734247 ^[[34m[2 hours, 39 minutes, 52 seconds]: STEP4 -> HOMOLOGY SEARCHES: 04.rundiamond.pl ^[[0m Setting block size for Diamond AVAILABLE (free) RAM memory: 239.60 Gb We will set Diamond block size to 16 (Gb RAM/8, Max 16). You can override this setting using the -b option when starting the project, or changing the $blocksize variable in SqueezeMeta_conf.pl

^[[34m[6 hours, 57 minutes, 16 seconds]: STEP5 -> HMMER/PFAM: 05.run_hmmer.pl

^[[34m[15 hours, 44 minutes, 6 seconds]: STEP6 -> TAXONOMIC ASSIGNMENT: 06.lca.pl

^[[34m[19 hours, 45 minutes, 34 seconds]: STEP7 -> FUNCTIONAL ASSIGNMENT: 07.fun3assign.pl

^[[34m[19 hours, 46 minutes, 38 seconds]: STEP9 -> CONTIG TAX ASSIGNMENT: 09.summarycontigs3.pl

^[[34m[19 hours, 49 minutes, 52 seconds]: STEP10 -> MAPPING READS: 10.mapsamples.pl

^[[34m[22 hours, 6 minutes, 32 seconds]: STEP11 -> COUNTING TAX ABUNDANCES: 11.mcount.pl

^[[0m^[[34m[22 hours, 6 minutes, 48 seconds]: STEP12 -> COUNTING FUNCTION ABUNDANCES: 12.funcover.pl

^[[34m[22 hours, 7 minutes, 26 seconds]: STEP13 -> CREATING GENE TABLE: 13.mergeannot2.pl ^[[34m[22 hours, 10 minutes, 3 seconds]: STEP14 -> BINNING: 14.runbinning.pl ^[[0mslurmstepd: error: JOB 30211795 ON nc20146 CANCELLED AT 2024-06-09T20:04:23 DUE TO TIME LIMIT `

Best, Vince

Vincev454 commented 3 weeks ago

Thanks for your reply Fernando!

I guess restricting anlaysis to MAGs would be appropriate since I'm mainly interested in finding MAGs abundance in my samples.

Here is a simplified version of the syslog output:

`[[0mRun started Sat Jun 8 16:04:28 2024 in sequential mode 1 metagenomes found: 091_V2 ^[[34m[21 seconds]: STEP1 -> RUNNING ASSEMBLY: 01.run_all_assemblies.pl (megahit) Number of contigs: 335240 ^[[34m[1 hours, 42 minutes, 38 seconds]: STEP2 -> RNA PREDICTION: 02.rnas.pl

^[[34m[1 hours, 47 minutes, 26 seconds]: STEP3 -> ORF PREDICTION: 03.run_prodigal.pl ORFs predicted: 734247 ^[[34m[2 hours, 39 minutes, 52 seconds]: STEP4 -> HOMOLOGY SEARCHES: 04.rundiamond.pl ^[[0m Setting block size for Diamond AVAILABLE (free) RAM memory: 239.60 Gb We will set Diamond block size to 16 (Gb RAM/8, Max 16). You can override this setting using the -b option when starting the project, or changing the $blocksize variable in SqueezeMeta_conf.pl

^[[34m[6 hours, 57 minutes, 16 seconds]: STEP5 -> HMMER/PFAM: 05.run_hmmer.pl

^[[34m[15 hours, 44 minutes, 6 seconds]: STEP6 -> TAXONOMIC ASSIGNMENT: 06.lca.pl

^[[34m[19 hours, 45 minutes, 34 seconds]: STEP7 -> FUNCTIONAL ASSIGNMENT: 07.fun3assign.pl

^[[34m[19 hours, 46 minutes, 38 seconds]: STEP9 -> CONTIG TAX ASSIGNMENT: 09.summarycontigs3.pl

^[[34m[19 hours, 49 minutes, 52 seconds]: STEP10 -> MAPPING READS: 10.mapsamples.pl

^[[34m[22 hours, 6 minutes, 32 seconds]: STEP11 -> COUNTING TAX ABUNDANCES: 11.mcount.pl

^[[0m^[[34m[22 hours, 6 minutes, 48 seconds]: STEP12 -> COUNTING FUNCTION ABUNDANCES: 12.funcover.pl

^[[34m[22 hours, 7 minutes, 26 seconds]: STEP13 -> CREATING GENE TABLE: 13.mergeannot2.pl ^[[34m[22 hours, 10 minutes, 3 seconds]: STEP14 -> BINNING: 14.runbinning.pl ^[[0mslurmstepd: error: JOB 30211795 ON nc20146 CANCELLED AT 2024-06-09T20:04:23 DUE TO TIME LIMIT `

Best, Vince

Le mar. 11 juin 2024 à 10:49, Fernando Puente-Sánchez < @.***> a écrit :

That feels slightly (but not crazily) slow but it is hard to tell since execution time will be very dependent on metagenome complexity. I suspect that the step taking the most time will be the annotation against the NCBIs nr database that we perform in step 4. Could you confirm this by looking at the syslog output? Depending on what you want to achieve there would be ways to speed things up by losing some accuracy in taxonomic assignment and/or restricting the analysis to MAGs instead of all the contigs.

— Reply to this email directly, view it on GitHub https://github.com/jtamames/SqueezeMeta/issues/850#issuecomment-2160150978, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWTWAQI5OASTD5ACFTMGLMTZG22YXAVCNFSM6AAAAABJCC72DGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRQGE2TAOJXHA . You are receiving this because you authored the thread.Message ID: @.***>

fpusan commented 3 weeks ago

Where are your databases located? I suspect that this may be a filesystem issue in which the databases are stored somewhere with a relatively high latency

Vincev454 commented 3 weeks ago

Here is my config file (SqueezeMeta_conf.pl):

version = "1.6.3, September 2023"; $mode = "sequential"; $date = "Sat Jun 8 16:04:29 2024";

$installpath = "/lustre07/scratch/thvar/magneto/miniforge3/envs/SqueezeMeta/SqueezeMeta"; $userdir = "/lustre07/scratch/thvar/squeeze/seqs";

-- Project dir (calculated dinamically on execution, DO NOT MODIFY)

use File::Basename; use Cwd 'abs_path'; $projectdir = abs_path(dirname(FILE));

-- Generic paths

$databasepath = "/media/disk7/fer/SqueezeMeta/db";

$databasepath = "/lustre07/scratch/thvar/test/DB/db"; $extdatapath = "$installpath/data"; $scriptdir = "$installpath/scripts"; #-- Scripts directory

-- Paths relative to the project

$projectname = "091_V2"; $datapath = "$projectdir/data";

-- Directory containing all datafiles

$resultpath = "$projectdir/results";

-- Directory for storing results

$extpath = "$projectdir/ext_tables";

-- Directory for storing tables for further analysis

$tempdir = "$projectdir/temp";

-- Temp directory

$interdir = "$projectdir/intermediate";

-- Temp directory

$binresultsdir = "$resultpath/bins";

-- Directory for bins

%dasdir = ("DASTool","$resultpath/DAS/$projectname_DASTool_bins");

-- Directory for DASTool results

-- Customizable binning and assembly

%assemblers = ("megahit","assembly_megahit.pl","spades","assembly_spades.pl","rnaspades", "assembly_rnaspades.pl", "spades-base", "assembly_spades-base.pl", "canu"," assembly_canu.pl","flye","assembly_flye.pl"); bly_flye.pl"); %binscripts = ("maxbin","$installpath/lib/SqueezeMeta/bin_maxbin.pl ","metabat2","$installpath/lib/SqueezeMeta/bin_metabat2.pl ","concoct","$installpath/lib/SqueezeMeta/bin_concoct.pl");

-- Result files

$mappingfile = "$datapath/00.$projectname.samples"; #-- Mapping file (samples -> fastq) $methodsfile = "$projectdir/methods.txt"; #-- File listing the methods used and their citation info $syslogfile = "$projectdir/syslog"; #-- Logging file $contigsfna = "$resultpath/01.$projectname.fasta"; #-- Contig file from assembly $contigslen = "$interdir/01.$projectname.lon"; #-- Length of each contig $rnafile = "$resultpath/02.$projectname.rnas"; #-- RNAs from barrnap $trnafile = "$resultpath/02.$projectname.trnas"; #-- tRNAs from aragorn $gff_file = "$resultpath/03.$projectname.gff"; #-- gff file from prodigal $aafile = "$resultpath/03.$projectname.faa"; #-- Aminoacid sequences for genes $ntfile = "$resultpath/03.$projectname.fna"; #-- Nucleotide sequences for genes $taxdiamond = "$interdir/04.$projectname.nr.diamond"; #-- Diamond result $cogdiamond = "$interdir/04.$projectname.eggnog.diamond"; #-- Diamond result, COGs $keggdiamond = "$interdir/04.$projectname.kegg.diamond"; #-- Diamond result, KEGG $pfamhmmer = "$interdir/05.$projectname.pfam.hmm"; #-- Hmmer result for Pfam $fun3tax = "$resultpath/06.$projectname.fun3.tax"; #-- Fun3 annotations, KEGG $fun3kegg = "$resultpath/07.$projectname.fun3.kegg"; #-- Fun3 annotations, KEGG $fun3cog = "$resultpath/07.$projectname.fun3.cog"; #-- Fun3 annotation, COGs $fun3pfam = "$resultpath/07.$projectname.fun3.pfam"; #-- Fun3 annotation, Pfams $gff_file_blastx = "$resultpath/08.$projectname.gff"; #-- gff file from prodigal & blastx $fun3tax_blastx = "$resultpath/08.$projectname.fun3.tax"; #-- Fun3 annotations prodigal & blastx, KEGG $fun3kegg_blastx = "$resultpath/08.$projectname.fun3.kegg"; #-- Fun3 annotations prodigal & blastx, KEGG

Le mar. 11 juin 2024 à 11:16, Fernando Puente-Sánchez < @.***> a écrit :

Where are your databases located? I suspect that this may be a filesystem issue in which the databases are stored somewhere with a relatively high latency

— Reply to this email directly, view it on GitHub https://github.com/jtamames/SqueezeMeta/issues/850#issuecomment-2160209013, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWTWAQKAVIPMUWJSNA3XF2LZG256DAVCNFSM6AAAAABJCC72DGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRQGIYDSMBRGM . You are receiving this because you authored the thread.Message ID: @.***>

jtamames commented 2 weeks ago

Hello The running time is not very high in my opinion, it falls within expectations. I assume that the problem is that the cluster kills the process after 24 hours. I would first ask the system admin if it is possible to extend the time limit. Otherwise, you can skip the step 5 (pfam searches), which is taking very long. Just add the --nopfam flag when running SqueezeMeta. Best, J

fpusan commented 2 weeks ago

The databases are in /lustre07/scratch/thvar/test/DB/db which seems to be a scratch partition. You can ask your cluster administrators whether a different location with less latency is available.

If you wanted to restrict your analysis to bins, you would need to use our development version since that feature is not in an official release yet (ETA for this is sometime in Summer).

You could try the following

# Install the dev version
conda create -n SqueezeMeta_dev -c conda-forge -c bioconda -c fpusan squeezemeta-dev==1.7.0.alpha8
conda activate SqueezeMeta_dev 
configure_nodb.pl /path/to/your/databases

# Make a first run that includes only assembly and binning
SqueezeMeta.pl -m coassembly -p proj_bins_pre -s samples.file -f /path/to/raw/fastqs --onlybins
# Make a second run in which you annotate only the bins instead of all the contigs
SqueezeMeta.pl -m coassembly -p proj_bins -s samples.file -f /path/to/raw/fastqs -extbins proj_bins_pre/results/bins --nopfam

We are also exploring ways of making DIAMOND run faster, but we still don't know how this would impact the quality of the annotations.

Vincev454 commented 2 weeks ago

Hi Javier and Fernando,

Thank you for your replies. I tested the pipeline using the --nopfam flag ans it's already much better: my last run was completed in about 10 hours instead of more than 24h for the previous one).

Thanks for your help!

Best, Vince