kevlar-dev / kevlar

Reference-free variant discovery in large eukaryotic genomes
https://kevlar.readthedocs.io
MIT License
40 stars 9 forks source link

Reimplement varfilter module #354

Closed standage closed 5 years ago

standage commented 5 years ago

Testing out a complete rewrite of the varfilter module. Should drastically reduce memory requirements. Run time comparison in progress.

codecov[bot] commented 5 years ago

Codecov Report

Merging #354 into master will decrease coverage by 0.08%. The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #354      +/-   ##
==========================================
- Coverage   97.13%   97.05%   -0.08%     
==========================================
  Files          48       48              
  Lines        2894     2886       -8     
  Branches      532      533       +1     
==========================================
- Hits         2811     2801      -10     
- Misses         51       52       +1     
- Partials       32       33       +1
Impacted Files Coverage Δ
kevlar/cli/varfilter.py 100% <ø> (ø) :arrow_up:
kevlar/intervalforest.py 77.42% <100%> (-4.06%) :arrow_down:
kevlar/vcf.py 96% <100%> (+0.03%) :arrow_up:
kevlar/varfilter.py 100% <100%> (ø) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update af7ae4e...ff64a40. Read the comment docs.

standage commented 5 years ago

The difference in runtime is tremendous: 3 minutes vs 30+ minutes before.

I don't know if the biggest factor is the difference in building vs querying an interval tree, or Python's object overhead, or what. In any case, storing a much smaller amount of data in memory and streaming the "big" data is always the better idea and should have been my first thought.