openvax / varcode

Library for manipulating genomic variants and predicting their effects
Apache License 2.0
81 stars 24 forks source link

support filtering a VariantCollection according to an intervals list file #59

Open timodonnell opened 9 years ago

timodonnell commented 9 years ago

For some of our analyses it would be helpful to be able to filter a VariantCollection to only those variants that fall within the intended capture targets.

Example intervals list:

[odonnt02@minerva4 ~]$ head /sc/orga/projects/ngs/resources/captures/2.3/Human_All_Exon_V5.hg19.interval_list
@HD     VN:1.4  SO:unsorted
@SQ     SN:chrM LN:16571        UR:file:/gs01/projects/ngs/resources/gatk/2.3/ucsc.hg19.parmasked.fasta M5:d2ed829b8a1628d16cbeee88e88e39eb
@SQ     SN:chr1 LN:249250621    UR:file:/gs01/projects/ngs/resources/gatk/2.3/ucsc.hg19.parmasked.fasta M5:1b22b98cdeb4a9304cb5d48026a85128
@SQ     SN:chr2 LN:243199373    UR:file:/gs01/projects/ngs/resources/gatk/2.3/ucsc.hg19.parmasked.fasta M5:a0d9851da00400dec1098a9255ac712e
@SQ     SN:chr3 LN:198022430    UR:file:/gs01/projects/ngs/resources/gatk/2.3/ucsc.hg19.parmasked.fasta M5:641e4338fa8d52a5b781bd2a2c08d3c3
@SQ     SN:chr4 LN:191154276    UR:file:/gs01/projects/ngs/resources/gatk/2.3/ucsc.hg19.parmasked.fasta M5:23dccd106897542ad87d2765d28a19a1
@SQ     SN:chr5 LN:180915260    UR:file:/gs01/projects/ngs/resources/gatk/2.3/ucsc.hg19.parmasked.fasta M5:0740173db9ffd264d728f32784845cd7
@SQ     SN:chr6 LN:171115067    UR:file:/gs01/projects/ngs/resources/gatk/2.3/ucsc.hg19.parmasked.fasta M5:1d3a93a248d92a729ee764823acbbc6b
@SQ     SN:chr7 LN:159138663    UR:file:/gs01/projects/ngs/resources/gatk/2.3/ucsc.hg19.parmasked.fasta M5:618366e953d6aaad97dbe4777c29375e
@SQ     SN:chr8 LN:146364022    UR:file:/gs01/projects/ngs/resources/gatk/2.3/ucsc.hg19.parmasked.fasta M5:96f514a9929e410c6651697bded59aec

GATK also supports a few other formats (probably not needed here though): https://www.broadinstitute.org/gatk/guide/article?id=1319

iskandr commented 9 years ago

I'm also thinking about the best way to filter variants by expression level (of either genes, transcripts, or allele-specific read count). Do you think these two use cases have enough in common to suggest a filtering API?

timodonnell commented 9 years ago

Maybe just a VariantCollection.filter function that takes a variant -> bool callable and returns a new VariantCollection?

Then could add a new module with filter implementations, including my and your examples.