lh3 / bgt

Flexible genotype query among 30,000+ samples whole-genome
MIT License
96 stars 10 forks source link

Handling large annotation sets #2

Open ekg opened 9 years ago

ekg commented 9 years ago

Could you provide efficient query across the annotations using a FM-index over the concatenated annotation strings from the VCF file? A second compressed bitvector could encode variant annotation starts in this record (basically storing a variant to annotation mapping).

Then you could subset to a given set of records with a particular annotation by finding the ranks of the occurrences of a given pattern in the auxiliary bitvector.

I guess this wouldn't help much when you have to compare floats in the annotations and the annotation is included in all records. Then you end up needing to compare lots of values to execute the query. There might also be a way around this though.

ekg commented 9 years ago

Not the most precise definition of what I mean so let me know if it needs clarification.

lh3 commented 9 years ago

BGT has a different design from VCF. I see annotating each VCF is a waste of resource, so I encourage to use a single variant annotation file for all BGT databases. You locate a particular row in BGT by an allele string like "11:10000:1:C". Currently, BGT reads through the variant annotation file to collect allele strings and then find rows in BGT. It is reasonably fast. The preferred way is really to have a proper disk-based database backend for annotations. SQLite could be an option. Cassandra would be better if performance becomes an issue.