kevlar-dev / kevlar

Reference-free variant discovery in large eukaryotic genomes
https://kevlar.readthedocs.io
MIT License
41 stars 9 forks source link

Write k-mers to banded counttables in a single pass #31

Open standage opened 7 years ago

standage commented 7 years ago

Suggestion from @drtamermansour: in a single pass, write count tables to N files (one for each band) in a single pass. Then running kevlar find in N bands would not require N passes over the entire data set, just loading the count tables from disk N times.

I just wanted to capture this suggestion, I have some concerns and I'm not sure it would yield much benefit.

And in any case, this is all optimization: there's still work to do to get reliable results first!


[1] There are ways we could investigate to do this in a streaming fashion, but for now I'm happy with saying we have to do a second pass over the reads. :-)

standage commented 7 years ago

After having run the pipeline a few more time, I'm less confident in the first bullet point now. If there were a way to write N banded counttables to disk with a single pass over the data, it could potentially make a big difference.