Open sjackman opened 8 years ago
Can sga index -a ropebwt
work with the output of sga fm-merge
? The mean sequence size is 300 bp, and the largest sequence is 30,889 bp.
Did you run out of memory with -d 20000000
? Without -a ropebwt
a memory inefficient algorithm is used. There is no 2 (or 4) billion nucleotide batch limit.
Whether it is worth using -a ropebwt
depends on the read length distribution. I suggest sticking with the recommended parameters (not ropebwt, -d X
). It shouldn't take very long.
The fm-merge FASTA file is 20 GB, so it should be possible to construct the BWT in a single pass using SAIS in roughly 200 GB RAM. I reported this issue because of the segfault, which is 😢. I'm happy with the -d 1000000
workaround though.
Did you run out of memory with -d 20000000?
I don't believe so. It was using 76 GB of RAM when it crashed, and the machine has 2.5 TB available.
It shouldn't take very long.
I'm using sga index -d 1000000
now. It has finished 41 of 69 batches in four hours, so it's trucking along nicely. 🏎
Have you read Optimal In-Place Suffix Sorting? https://arxiv.org/abs/1610.08305 It seems worth checking out. @rob-p brought it to my attention.
sga index -d 1000000
completed in 25 hours.
sga index -d 1000000 -t 64 hsapiens.preprocess.filter.pass.merged.fa
205964.05s user 3080.39s system 232% cpu 24:56:18.90 total 9111 MB
Thanks for the update. I did see that paper from @rob-p's twitter - its on my to-read list :)
Here's the wallclock and memory results for SGA on human HG004 data with and without fm-mege
. (a memo to self and for future curious readers)
fm-merge | Wallclock (h) | Peak Memory (GB) |
---|---|---|
FALSE | 65.4 | 270.35938 |
TRUE | 65.0 | 82.24316 |
Interesting, thanks! I wouldn't have expected the runtimes to be (nearly) the same, but it is good to see.
It was surprising to me to. Running fm-merge
first speeds up overlap
and assemble
quite a bit. I found that rmdup
after fm-merge
didn't remove any sequences. Is it necessary, or did I just get lucky?
sga index -d 1000000
succeeded.
sga index -d 10000000
succeeded.
sga index -d 20000000
segfaulted.
The command
sga index -d 20000000 -t 64 hsapiens.preprocess.filter.pass.merged.fa
segfaults with-d 20000000
. Reducing to-d 1000000
works. Is each BWT batch size limited in size, perhaps to 2 or 4 billion nucleotides?-d 20000000
with a mean sequence size of ~300 bp should correspond to a batch size of about 6 Gbp.