Open standage opened 7 years ago
Can you do your preprocessing on a per read basis or is it per kmer? Wondering if you could speed things along by passing a read to consume_string
instead of calling add
for each kmer. It might reduce the (notoriously large) function call overhead of python and maybe we make some gains by having the loop in c++?
Perhaps. I'd need to create a consume_string_banding
function, but probably want to do this anyway,
That said, based on my most recent discussion with Fereydoun, we're probably going to settle for discarding reads with non-ACGT. This represents a small portion of data, and if any of our variant calls are coming only from these potentially problematic reads then that's a cause for skepticism anyway. And if I understand correctly, the Cython-based read pre-processing is moving in this direction anyway, so this should make our life much easier!
At the moment kevlar is painfully slow. We've been philosophizing for a while now whether this was more likely due to the poor cache locality of khmer's Count-Min Sketch implementation or to slow Fastq parsing/handling, or some other factor or combination of factors. It's time to evaluate this empirically.
I profiled both
kevlar count
andkevlar novel
on the new simulated data set I created in #83. It's small enough for kevlar to process in several minutes, but large enough to observe meaningful numbers in the profile. Forkevlar novel
, I computed k-mer counts from scratch, not using precomputed counts.Here are the top 25 functions, sorted by time spent, for each.
In both cases, the
add
andget
functions for incrementing and querying k-mer counts from the Count-Min sketch dominate the runtime, with theget_kmers
andre.split
functions as heavy hitters as well. The latter two have to do with the fact that the khmer's bulk loading functions don't support the kind of preprocessing we need for kevlar, so I'm doing it in Python. This incurs overhead in sending the data from C++ to Python objects, and then doing the processing in Python.As far as priorities, I'm not sure there's much we can do about the Count-Min sketch implementation. We could try to implement buffering (collect N≈1e4 add operations before actually incrementing the tables) but honestly once the CQF-based counter is integrated we may already get much better performance from that. For the sequence loading, I'll continue to lobby for better support for multiple pre-processing strategies with khmer bulk Fastq loading code.