jts / sga

de novo sequence assembler using string graphs
http://genome.cshlp.org/content/22/3/549
238 stars 82 forks source link

"Assertion `numStrings < MAX_ELEMS' failed" #53

Open kimrutherford opened 11 years ago

kimrutherford commented 11 years ago

While running sga index I get this:

Building index for preprocessed.fastq in memory using ropebwt
         done bwt construction, generating .sai file
sga: SampledSuffixArray.cpp:163: void SampledSuffixArray::buildLexicoIndex(const BWT*, int): Assertion `numStrings < MAX_ELEMS' failed.
Aborted (core dumped)

The input file "preprocessed.fastq" was produced by sga preprocess and looks OK as far as I can tell.

The input is rather large: preprocessed.fastq is 1.1 TB. The genome is tuatara which is ~5GB and repetitive. We have very high coverage, hence the 1.1 TB of sequence.

I tried sga index/preqc on a sightly smaller tuatara dataset a month ago and everything worked well. (preqc is very nice!) We added ~100 GB of new data this week so I'm trying again.

I'm using the head of the master branch from GitHub: 6a5dc43b7e1495040e which has worked fine for several other (smaller) datasets.

The core file was truncate by a ulimit limit so probably isn't much use. gdb says:

Reading symbols from /usr/local/sga-v0.9.4-908-g6a5dc43/bin/sga...(no debugging symbols found)...done.
BFD: Warning: /var/scratch/tuatara/v6/sga/core is truncated: expected core file size >= 171233517568, found: 102400000000.
[New LWP 26492]
Cannot access memory at address 0x7f781683b1a8
Cannot access memory at address 0x7f781683b1a0
(gdb) bt
#0  0x00007f7815063475 in ?? ()
Cannot access memory at address 0x7fffda223448

I'm happy to run it again with no ulimit if that would be useful. It takes a day or two to run though.

Thanks for your help.

jts commented 11 years ago

Hi Kim,

SGA represents read IDs using a 32-bit integer to save space. This limits the maximum data set size to just over 4 billion reads. It looks like your data set has now exceeded this limit. Luckily it should be a one-line fix:

https://github.com/jts/sga/blob/master/src/SuffixTools/SampledSuffixArray.h#L19

Change this line to typedef uint64_t SSA_INT_TYPE;, recompile and your index should build successfully. I'd rather not make this change directly in the main codebase since all users would have to pay the extra memory cost. I'd consider making it a compile-time flag though.

Let me know if this fixes the problem.

Jared

kimrutherford commented 11 years ago

Hi Jared.

Thanks very much for the speedy reply.

I've made that change. I needed to change SAReader too to make it compile: https://github.com/kimrutherford/sga/commit/2572d3528e626c6d8f7ff487ada37eed9d1d8567

I hope that's an appropriate fix.

I'm running sga index again now with the change. I'll update this issue to let you know how it goes.

Thanks for your help.

jts commented 11 years ago

Good catch. I just noticed you'll need to change this one as well:

https://github.com/kimrutherford/sga/blob/2572d3528e626c6d8f7ff487ada37eed9d1d8567/src/SuffixTools/SAReader.cpp#L79

It should really be refactored so that all these read from SSA_INT_TYPE.

Jared

On Thu, Oct 10, 2013 at 3:51 AM, Kim Rutherford notifications@github.comwrote:

Hi Jared.

Thanks very much for the speedy reply.

I've made that change. I needed to change SAReader too to make it compile: kimrutherford@2572d35https://github.com/kimrutherford/sga/commit/2572d3528e626c6d8f7ff487ada37eed9d1d8567

I hope that's an appropriate fix.

I'm running sga index again now with the change. I'll update this issue to let you know how it goes.

Thanks for your help.

— Reply to this email directly or view it on GitHubhttps://github.com/jts/sga/issues/53#issuecomment-26025582 .

kimrutherford commented 11 years ago

Thanks Jared. I've made that changed and I've restarted sga index.

jts commented 10 years ago

Hi Kim,

Can I close this issue?

kimrutherford commented 10 years ago

Sorry for not following up. I'm still getting the assertion failure:

 sga: SampledSuffixArray.cpp:167: void SampledSuffixArray::buildLexicoIndex(const BWT*, int): Assertion `numStrings < MAX_ELEMS' failed.

I made another change that I thought would help, but I'm still seeing the same problem. Here's the other change I made: https://github.com/kimrutherford/sga/commit/fe427bd14969ea76ec2b7877a332d8c83b926c02

I didn't get any further.