The first draft implementation of ProperStatWindowEngine (#456) could have some hotspots for performance improvement as described bellow:
Use IntervalsSkipList/IntervalTreeMap/OverlapDetector or a similar interval-based collection instead of the sorted map for contig-intervals. First implementation uses a sorted-map to get all the intervals for the read's contig, but this isn't efficient if there are lots of intervals - engine would call add to each window even if it is not overlapping in the first place. Using the IntervalsSkipList might help to retrieve all the windows overlapping with the read and the mate and add only to that ones (the query method should be more efficient than adding to every window to have some performance gain).
Whatever the collection is for the windows, another performance improvement is to flush the List to disk for windows where the left-most coordinate (read or mate) is before the current read. This is based on the contract assumption that the reads are added in coordinate sorted way: if the read/mate already passed one window, then no other read will be added. This have several performance improvements: 1) memory-usage is reduced, as less windows are kept in memory; 2) avoid to add the read to previous windows (related to the previous point if we keep the current implementation) or faster queries due to less intervals (if the previous point is addressed). This is a simpler fix, but it can have even worse effects if the added-reads are not coordinate sorted.
For the add method, the for loop can be break for downstream reads. This is similar to the previous point, but for downstream windows (they are not flushed).
The first draft implementation of
ProperStatWindowEngine
(#456) could have some hotspots for performance improvement as described bellow:IntervalsSkipList
/IntervalTreeMap
/OverlapDetector
or a similar interval-based collection instead of the sorted map for contig-intervals. First implementation uses a sorted-map to get all the intervals for the read's contig, but this isn't efficient if there are lots of intervals - engine would calladd
to each window even if it is not overlapping in the first place. Using theIntervalsSkipList
might help to retrieve all the windows overlapping with the read and the mate and add only to that ones (the query method should be more efficient than adding to every window to have some performance gain).List
to disk for windows where the left-most coordinate (read or mate) is before the current read. This is based on the contract assumption that the reads are added in coordinate sorted way: if the read/mate already passed one window, then no other read will be added. This have several performance improvements: 1) memory-usage is reduced, as less windows are kept in memory; 2) avoid to add the read to previous windows (related to the previous point if we keep the current implementation) or faster queries due to less intervals (if the previous point is addressed). This is a simpler fix, but it can have even worse effects if the added-reads are not coordinate sorted.add
method, the for loop can be break for downstream reads. This is similar to the previous point, but for downstream windows (they are not flushed).