Open gaow opened 7 years ago
I think we first need to find a statistical method that uses phase information. Then we can add phase support while we add support for the method.
At lease some methods for association with family data and haplotype associations will need phase (though a lot of them not!) info. But I was not even thinking that far yet. There are something more elementary that would use phase info. For example, advanced methods for calculating LD with shrinkage would require data be phased, and this is a QC step. The input would be just phased haplotype matrix calculated one way or another from a different software. To justify this QC: good estimate of LD is crucial to fine-mapping with WGS data compared to using just common variants.
I guess the general concern is that it does not sound very good when user input is phased data yet we drop the information. I envisage sometimes we'd just use VAT for data storage not any analysis; not storing phase would be a game killer.
Also I'm about to suggest storing separate matrices for other genotype information anyways, so it might be possible we eventually want to use 2 matrices for each haplotype
The DP design could be borrowed by VAT, namely we can store the most used and most regular data and retrieve other information if needed. This could save diskspace and potentially improve performance in a majority for the cases. The same hold for QC because what VAT got (and the case Dr. Leal is only interested in right now) can be data that has been QCed.
I acquired some phased data today for my project:
This reminds me that we have not considered this possibility in our discussion, although we did have talked about it years ago. How should this information be stored and easily used? I can imaging each sample would have 2 haplotype matrices instead of one genotype matrix?
I'm going to use a separate issue to summarize limitations of vtools storage in genotypes and see if we can address them all, one way or another.