kevlar-dev / kevlar

Reference-free variant discovery in large eukaryotic genomes
https://kevlar.readthedocs.io
MIT License
40 stars 9 forks source link

Implement max target length filter #366

Closed standage closed 5 years ago

standage commented 5 years ago

The --max-diff setting in the localize module has been in place for a while, and determines the distance required between adjacent seed matches to split a reference target into multiple targets. It is effective when a contig maps to multiple locations and kevlar needs to distinguish the optimal alignment(s) for variant calling.

However, in some exceptional cases the reference target includes an array of tandem repeats that spans a huge genomic interval. Each repeat unit is separated by several kilobases, small enough that two adjacent units are not split by the --max-diff setting. Reducing the --max-diff setting to filter out these loci would compromise kevlar's ability to capture large indels.

This update introduces a complementary filter to discard unreasonably long reference targets that introduce tremendous performance problems at the alignment/calling step. Any reference target longer than L bp (default=10kb) will be discarded, and the associated contig will be reported in the VCF with no genomic position.

This update also includes the latest results from simulation experiments.

codecov-io commented 5 years ago

Codecov Report

Merging #366 into master will increase coverage by 0.02%. The diff coverage is 90.91%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #366      +/-   ##
==========================================
+ Coverage   96.83%   96.85%   +0.02%     
==========================================
  Files          48       48              
  Lines        2906     2923      +17     
  Branches      538      543       +5     
==========================================
+ Hits         2814     2831      +17     
  Misses         58       58              
  Partials       34       34
Impacted Files Coverage Δ
kevlar/alac.py 98.18% <ø> (ø) :arrow_up:
kevlar/call.py 97.17% <100%> (+2.02%) :arrow_up:
kevlar/cli/alac.py 100% <100%> (ø) :arrow_up:
kevlar/cli/call.py 100% <100%> (ø) :arrow_up:
kevlar/varmap.py 99.05% <100%> (+0.03%) :arrow_up:
kevlar/vcf.py 92.76% <71.43%> (-0.57%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 1f80fd0...e1012ea. Read the comment docs.