Open jgans opened 4 years ago
The big difference between these two runs is ERR191522 contains alignments and DRR001375 contains only unaligned reads.
In the unaligned case, the reads are stored together in one record. The access pattern in your code snippet will be in the same order as the data is laid out in the file.
In the aligned case, the reads are stored in the order they align on the reference. The two mate pairs are not stored together and might be far apart in the file. Your code snippet will reconstruct the whole read which requires random access I/O to the file. Once upon a time, this wasn't a big problem, but these days big disks are structured as bucket stores with random I/O being very expensive. To try to ameliorate this, we have tuned our underlying database technology to cache aggressively. If your usage does NOT require both mate pairs, it could be sped up.
We are creating an internal ticket to look into this behavior. Thanks for the report and the graphs!
For the particular problem we're trying to solve, we do not need to access mate pairs together (or in any particular order). Is there a code example that illustrates how to access read data without the need for aggressive caching or random access I/O?
When a run that is aligned has been detected, the reads will be available first through their primary alignments. If these are retrieved as individual reads in absence of quality scores, they will generate the best performance. The remaining reads will be those that did not align, and they can be fetched separately.
You start with a ReadCollection
.
You can then ask how many primary alignments it has with ReadCollection::getAlignmentCount(Alignment::primaryAlignment)
.
If there are alignments, obtain them via ReadCollection::getAlignments(Alignment::primaryAlignment)
and iterate across them.
The aligned read is obtained via Alignment::getAlignedFragmentBases()
. You may need to compensate for orientation.
Once finished, look for unaligned reads via ReadCollection::getReadCount(Read::unaligned)
, and similarly obtain an iterator on them via ReadCollection::getReads(Read::unaligned)
.
The ReadIterator is an extension of FragmentIterator, and should be treated as such. Otherwise, you will tend to assemble paired reads which will cause issues.
Walk the reads using FragmentIterator::nextFragment()
and obtain bases with Fragment::getFragmentBases()
.
It's also useful to avoid converting a StringRef
into a std::string
for everything - do so only when necessary. Try to use StringRef
to access data and size rather than always NUL-terminating.
Okay, I think that's it - off the top of my head. If I left anything out, @durbrow please correct me!
When counting the number of sequences in the primary alignment and unaligned reads, I get a slightly different sequence count from the value obtained by iterating over getReadCount(ngs::Read::all)
"raw" reads. Is there another source of read data in an SRA record?
For example, in SRR10742149:
getAlignments(ngs::Alignment::primaryAlignment)
ReadIterator( run.getReads(ngs::Read::unaligned) )
Total number of sequences = 83490707
However, there are 83491358 sequences in 41745679 reads, using ngs::ReadIterator( run.getReadRange ( 1, getReadCount(ngs::Read::all), ngs::Read::all ) )
.
Yes, I probably should have mentioned that.
As you see, ngs::Read::all
gives aligned, partially aligned, and unaligned reads. Looking at the primary alignments gets you the ngs::Read::aligned
category, and ngs::Read::unaligned
gets you the cases where neither mate is aligned.
For the cases where one mate is aligned but the other not, these are partially aligned and we don't have a nice, clean category for them. If you ask for all but iterate as fragments, this may be what you need.
Thank you for suggesting this strategy! It does indeed provide a significant reduction in memory consumed and a significant speedup as well. As shown below, the memory consumption required to read through ERR191522 is an order-of-magnitude less than the previous, naive strategy, and the run-time is a factor of 14 faster (16 min vs 225 min):
To facilitate comparison with the previous read pairs/sec scale, this graph shows the number of aligned sequences/2 per second. The memory consumption is now the "change in memory consumption" (starting from the beginning of program execution). For my particular application, this performance improvement is large enough to justify ignoring the partially aligned read pairs.
Great to hear!
One of the things that is problematic is assembling the mates. If you can take the mates in isolation, you'll have a better day.
While the above strategy of first loading primary alignments and then loading unaligned reads works with many (most?) SRA records (like the above mentioned ERR191522), there appear to be some SRA records for which this strategy generates an error.
For example, all of the 2007231 primary alignments in ERR634825 can be read, but the following C++ exception is thrown when attempting iterate through the 846553 unaligned reads: VCursorAddColumn failed: '(INSDC:dna:text)CMP_READ' rc = RC(rcVDB,rcCursor,rcUpdating,rcColumn,rcNotFound)
. An incomplete list of other examples of SRA records that yield this same error include: ERR634636, ERR634688, ERR634695, and ERR634696. All of these records appear to be single read, AB Solid4 experiments from the same study.
Ah yes, aligned colorspace! They are broken, and unfortunately they can't (*) be fixed. Is it crucial for you to have these reads? Although I don't have a count, I do know there aren't very many these in the SRA.
* the people who decide such things decided it wasn't worth the expense
These reads are not crucial. However, is it possible to interrogate an ngs::ReadCollection
object to detect aligned colorspace data prior to loading reads? I could then use ngs::ReadIterator(ngs::Read::all)
to iterate over all of the reads in these SRA records (instead of reading primary alignments and unaligned reads).
As it stands, I can identify records that can successfully load 100% of the ngs::Alignment::primaryAlignment
data but fail on the first ngs::Read::unaligned
read. The main downside of this approach would be the wast of time reading all of the primary alignments.
I am reading sequence data from SRA records by first downloading the SRA record with the
prefetch
command and then iterating through the file using the C++ interface (version 2.10.8), i.e.:In general, this approach seems to work well. However, I have noticed that for some SRA records (like ERR191522), there is (a) significant memory consumption and (b) a dramatic slow-down when iterating through the file. The following plot shows the speed (in reads per second) and memory consumption (from
/proc/meminfo
, reported as a fraction of total system memory):Other SRA records seem to be fine. For DRR001375, the following graph shows fairly constant speed and memory usage:
Is there a way to read SRA records, like ERR191522, without the large memory consumption? If not, is there a way to identify SRA records (in advance) that will exhibit this behavior (as the available RAM on on cluster instances can easily be exhausted while processing a single SRA record).