vatlab / VarStore

High Efficiency genotype data storage library
http://vatlab.github.io/VarStore/
0 stars 0 forks source link

How should we store phase information #10

Open gaow opened 7 years ago

gaow commented 7 years ago

I acquired some phased data today for my project:

1       10565   1_10565_C_T_b37 C       T       .       PASS    .       GT      0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     .|.     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0  

This reminds me that we have not considered this possibility in our discussion, although we did have talked about it years ago. How should this information be stored and easily used? I can imaging each sample would have 2 haplotype matrices instead of one genotype matrix?

I'm going to use a separate issue to summarize limitations of vtools storage in genotypes and see if we can address them all, one way or another.

BoPeng commented 7 years ago

I think we first need to find a statistical method that uses phase information. Then we can add phase support while we add support for the method.

gaow commented 7 years ago

At lease some methods for association with family data and haplotype associations will need phase (though a lot of them not!) info. But I was not even thinking that far yet. There are something more elementary that would use phase info. For example, advanced methods for calculating LD with shrinkage would require data be phased, and this is a QC step. The input would be just phased haplotype matrix calculated one way or another from a different software. To justify this QC: good estimate of LD is crucial to fine-mapping with WGS data compared to using just common variants.

I guess the general concern is that it does not sound very good when user input is phased data yet we drop the information. I envisage sometimes we'd just use VAT for data storage not any analysis; not storing phase would be a game killer.

Also I'm about to suggest storing separate matrices for other genotype information anyways, so it might be possible we eventually want to use 2 matrices for each haplotype

BoPeng commented 7 years ago

The DP design could be borrowed by VAT, namely we can store the most used and most regular data and retrieve other information if needed. This could save diskspace and potentially improve performance in a majority for the cases. The same hold for QC because what VAT got (and the case Dr. Leal is only interested in right now) can be data that has been QCed.