shandley / hecatomb

hecatomb is a virome analysis pipeline for analysis of Illumina sequence data
MIT License
53 stars 12 forks source link

out-of-memory error during population_assembly.flye, STAGE: repeat #99

Closed mihinduk closed 5 months ago

mihinduk commented 8 months ago

Hi Mike, I have a dataset of 288 paired fastqs (of which 4 are mock controls). For the non-mock controls, the number of input reads in the fastq files ranges from 265,771 - 197,238,830. I am having trouble getting this through the population assembly step due to memory issues - I am not sure if this is due to the number of input contigs (637,144) or the large size of some contigs (I have 3 contigs larger than 1,000,000 bp: 1,029,230 1,147,109 1,150,404)

I have tried increasing both memory and runtime and wonder if you can give me further advice. I have attached my current contig.yaml files (saved as text files).

Here is the flye log: more /scratch/sahlab/kathie/AHandley/2023_10_05_hecatomb_out/stderr/population_assembly.flye.log

[2023-10-23 10:05:04] INFO: Starting Flye 2.7.1-b1590
[2023-10-23 10:05:04] INFO: >>>STAGE: configure
[2023-10-23 10:05:04] INFO: Configuring run
[2023-10-23 10:05:18] INFO: Total read length: 3322349930
[2023-10-23 10:05:18] INFO: Input genome size: 1000000000
[2023-10-23 10:05:18] INFO: Estimated coverage: 3
[2023-10-23 10:05:18] WARNING: Expected read coverage is 3, the assembly is not guaranteed to be optimal in this setting. Are you sure that the genome size was entered correctly?
[2023-10-23 10:05:18] INFO: Reads N50/N90: 21048 / 1531
[2023-10-23 10:05:18] INFO: Minimum overlap set to 1000
[2023-10-23 10:05:18] INFO: Selected k-mer size: 31
[2023-10-23 10:05:18] INFO: >>>STAGE: assembly
[2023-10-23 10:05:18] INFO: Assembling disjointigs
[2023-10-23 10:05:18] INFO: Reading sequences
[2023-10-23 10:05:41] INFO: Generating solid k-mer index
[2023-10-23 10:10:16] INFO: Counting k-mers (1/2):
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[2023-10-23 10:10:38] INFO: Counting k-mers (2/2):
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[2023-10-23 10:11:36] INFO: Filling index table
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[2023-10-23 10:13:25] INFO: Extending reads
[2023-10-23 10:13:37] INFO: Overlap-based coverage: 4
[2023-10-23 10:13:37] INFO: Median overlap divergence: 0.00802146
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[2023-10-23 10:56:41] INFO: Added 352634 singleton reads
[2023-10-23 10:56:42] INFO: Assembled 370904 disjointigs
[2023-10-23 10:56:44] INFO: Generating sequence
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[2023-10-23 11:00:11] INFO: >>>STAGE: repeat
[2023-10-23 11:00:11] INFO: Building and resolving repeat graph
[2023-10-23 11:00:11] INFO: Parsing disjointigs
[2023-10-23 11:00:34] INFO: Building repeat graph
0% [2023-10-23 11:17:20] ERROR: Looks like the system ran out of memory
[2023-10-23 11:17:20] ERROR: Command '['flye-modules', 'repeat', '--disjointigs', '/scratch/sahlab/kathie/AHandley/2023_10_05_hecatomb_out/processing/assembly/CONTIG_DICTIONARY/FLYE/00-assembly/draft_assembly.fast
a', '--reads', '/scratch/sahlab/kathie/AHandley/2023_10_05_hecatomb_out/processing/assembly/CONTIG_DICTIONARY/all_sample_contigs.fasta', '--out-dir', '/scratch/sahlab/kathie/AHandley/2023_10_05_hecatomb_out/proces
sing/assembly/CONTIG_DICTIONARY/FLYE/20-repeat', '--config', '/ref/sahlab/software/anaconda3/envs/hecatomb_v1.1.0/snakemake/conda/78b51fcc2d6f30f86bbeb3faca815897_/lib/python3.7/site-packages/flye/config/bin_cfg/a
sm_subasm.cfg', '--log', '/scratch/sahlab/kathie/AHandley/2023_10_05_hecatomb_out/processing/assembly/CONTIG_DICTIONARY/FLYE/flye.log', '--threads', '16', '--min-ovlp', '1000', '--kmer', '31']' died with <Signals.
SIGKILL: 9>.
[2023-10-23 11:17:20] ERROR: Pipeline aborted

hecatomb.config.yaml.txt config.yaml.txt

Thank you for your help, Kathie

beardymcjohnface commented 8 months ago

Hi Kathie, unfortunately the flye --subassemblies is a bit slow and memory hungry, but I haven't found a good alternative yet :/

What is the max memory available on your nodes? i remember you have 24 CPUs per node but your config is capping the memory at 100gb for the job. We also should look into adding custom scheduler commands in the config so you can pass --exclusive for those jobs.

beardymcjohnface commented 5 months ago

I would suggest also trying --assembly cross as megahit is pretty good with memory management.