jts / sga

de novo sequence assembler using string graphs
http://genome.cshlp.org/content/22/3/549
239 stars 82 forks source link

sga index segfault with large values of -d #131

Open sjackman opened 8 years ago

sjackman commented 8 years ago

The command sga index -d 20000000 -t 64 hsapiens.preprocess.filter.pass.merged.fa segfaults with -d 20000000. Reducing to -d 1000000 works. Is each BWT batch size limited in size, perhaps to 2 or 4 billion nucleotides? -d 20000000 with a mean sequence size of ~300 bp should correspond to a batch size of about 6 Gbp.

sjackman commented 8 years ago

Can sga index -a ropebwt work with the output of sga fm-merge? The mean sequence size is 300 bp, and the largest sequence is 30,889 bp.

jts commented 8 years ago

Did you run out of memory with -d 20000000? Without -a ropebwt a memory inefficient algorithm is used. There is no 2 (or 4) billion nucleotide batch limit.

jts commented 8 years ago

Whether it is worth using -a ropebwt depends on the read length distribution. I suggest sticking with the recommended parameters (not ropebwt, -d X). It shouldn't take very long.

sjackman commented 8 years ago

The fm-merge FASTA file is 20 GB, so it should be possible to construct the BWT in a single pass using SAIS in roughly 200 GB RAM. I reported this issue because of the segfault, which is 😢. I'm happy with the -d 1000000 workaround though.

Did you run out of memory with -d 20000000?

I don't believe so. It was using 76 GB of RAM when it crashed, and the machine has 2.5 TB available.

It shouldn't take very long.

I'm using sga index -d 1000000 now. It has finished 41 of 69 batches in four hours, so it's trucking along nicely. 🏎

sjackman commented 8 years ago

Have you read Optimal In-Place Suffix Sorting? https://arxiv.org/abs/1610.08305 It seems worth checking out. @rob-p brought it to my attention.

sjackman commented 8 years ago

sga index -d 1000000 completed in 25 hours.

sga index -d 1000000 -t 64 hsapiens.preprocess.filter.pass.merged.fa
205964.05s user 3080.39s system 232% cpu 24:56:18.90 total 9111 MB
jts commented 8 years ago

Thanks for the update. I did see that paper from @rob-p's twitter - its on my to-read list :)

sjackman commented 8 years ago

Here's the wallclock and memory results for SGA on human HG004 data with and without fm-mege. (a memo to self and for future curious readers)

fm-merge Wallclock (h) Peak Memory (GB)
FALSE 65.4 270.35938
TRUE 65.0 82.24316
jts commented 8 years ago

Interesting, thanks! I wouldn't have expected the runtimes to be (nearly) the same, but it is good to see.

sjackman commented 8 years ago

It was surprising to me to. Running fm-merge first speeds up overlap and assemble quite a bit. I found that rmdup after fm-merge didn't remove any sequences. Is it necessary, or did I just get lucky?

sjackman commented 8 years ago

sga index -d 1000000 succeeded. sga index -d 10000000 succeeded. sga index -d 20000000 segfaulted.