Open kimrutherford opened 11 years ago
Hi Kim,
SGA represents read IDs using a 32-bit integer to save space. This limits the maximum data set size to just over 4 billion reads. It looks like your data set has now exceeded this limit. Luckily it should be a one-line fix:
https://github.com/jts/sga/blob/master/src/SuffixTools/SampledSuffixArray.h#L19
Change this line to typedef uint64_t SSA_INT_TYPE;
, recompile and your index should build successfully. I'd rather not make this change directly in the main codebase since all users would have to pay the extra memory cost. I'd consider making it a compile-time flag though.
Let me know if this fixes the problem.
Jared
Hi Jared.
Thanks very much for the speedy reply.
I've made that change. I needed to change SAReader too to make it compile: https://github.com/kimrutherford/sga/commit/2572d3528e626c6d8f7ff487ada37eed9d1d8567
I hope that's an appropriate fix.
I'm running sga index again now with the change. I'll update this issue to let you know how it goes.
Thanks for your help.
Good catch. I just noticed you'll need to change this one as well:
It should really be refactored so that all these read from SSA_INT_TYPE.
Jared
On Thu, Oct 10, 2013 at 3:51 AM, Kim Rutherford notifications@github.comwrote:
Hi Jared.
Thanks very much for the speedy reply.
I've made that change. I needed to change SAReader too to make it compile: kimrutherford@2572d35https://github.com/kimrutherford/sga/commit/2572d3528e626c6d8f7ff487ada37eed9d1d8567
I hope that's an appropriate fix.
I'm running sga index again now with the change. I'll update this issue to let you know how it goes.
Thanks for your help.
— Reply to this email directly or view it on GitHubhttps://github.com/jts/sga/issues/53#issuecomment-26025582 .
Thanks Jared. I've made that changed and I've restarted sga index.
Hi Kim,
Can I close this issue?
Sorry for not following up. I'm still getting the assertion failure:
sga: SampledSuffixArray.cpp:167: void SampledSuffixArray::buildLexicoIndex(const BWT*, int): Assertion `numStrings < MAX_ELEMS' failed.
I made another change that I thought would help, but I'm still seeing the same problem. Here's the other change I made: https://github.com/kimrutherford/sga/commit/fe427bd14969ea76ec2b7877a332d8c83b926c02
I didn't get any further.
While running sga index I get this:
The input file "preprocessed.fastq" was produced by sga preprocess and looks OK as far as I can tell.
The input is rather large: preprocessed.fastq is 1.1 TB. The genome is tuatara which is ~5GB and repetitive. We have very high coverage, hence the 1.1 TB of sequence.
I tried sga index/preqc on a sightly smaller tuatara dataset a month ago and everything worked well. (preqc is very nice!) We added ~100 GB of new data this week so I'm trying again.
I'm using the head of the master branch from GitHub: 6a5dc43b7e1495040e which has worked fine for several other (smaller) datasets.
The core file was truncate by a ulimit limit so probably isn't much use. gdb says:
I'm happy to run it again with no ulimit if that would be useful. It takes a day or two to run though.
Thanks for your help.