jts / sga

de novo sequence assembler using string graphs
http://genome.cshlp.org/content/22/3/549
237 stars 82 forks source link

sga assemble reaches edge limit #132

Closed johnomics closed 7 years ago

johnomics commented 7 years ago

sga assemble crashed with this message:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
WARNING: [Edge limit reached for vertex when loading graph]

This is after the following steps, just trying to get a vanilla assembly before optimizing:

sga preprocess -p1 R1.trimmed.fastq.gz R2.trimmed.fastq.gz > reads.fastq
sga correct -t 64 -v --learn --discard reads.fastq
sga index -a ropebwt -t 64 reads.ec.fq
sga filter -t 64 -v reads.ec.fq
sga overlap -v -t 64 -m 75 reads.ec.filter.pass.fq

asqg.gz file is 25 GB in size, and assemble job is running on a 256GB RAM machine.

Input reads are 150bp paired end but from a strange library prep - pooled reads from 95 separate MDA and Nextera preps. Here is the preqc report for the raw library and post sga filter reads. preqc_report.pdf

As the data set is odd and highly variable, maybe it won't assemble, but I'd appreciate tips on how to get the assembly step to complete - I can see maybe it's just a case of increasing --max-edges, but if so, I'm a bit surprised sga crashed rather than failing gracefully, so thought it was worth reporting. Thanks.

jts commented 7 years ago

Hi John,

The warning is normal (for large genomes) and not something you have to worry about. The problem is the bad_alloc message, which suggests the process ran out of memory. I think the culprit here is the MDA library prep, which SGA is not prepared to deal with. Have you tried a single-cell assembler like SPAdes?

Jared

johnomics commented 7 years ago

Thanks for the reassurance - looks like we have a memory configuration problem on our servers, as other programs are also producing memory errors (including SPAdes, but not just assemblers). I'll try sga again once we've figured that out. Best wishes, John