Closed ConnorBarnhill closed 3 years ago
Hey Connor, this is related to solving the n+1 problem in genomics using delta lake.
Best practice is to explode pVCFs to one sample one genotype per row. Split multiallelics, left normalize. Append them all into a Bronze Delta Lake Table.
Do a groupBy to get a table of unique variants. And use this to convert individual VCFs into "gVCFs", so each position has a genotype call. Append "gVCFs" into a silver delta table. Partition on contigName, Zorder by start. This is the probably the best way to store genotype data.
Then when doing analysis groupBy variant + cohort, collect_array on genotypes!
I have a couple of example notebooks for simulating gVCFs, could be adapted? Will need some work.
What do you think?
Hi William,
Thanks for the response. That's what I figured, I wasn't sure if there was a better way to achieve this without exploding the genotypes field.
You're on the right track though,
When working interactively on continually growing large datasets, there has to be a way to do joins, update statistics and filter down to something manageable for analytics.
The old school hypothesis-free genome-wide approach does not work on population scale data paired with real-world/clinical data!
Not clear how to do this from the "Merging Variant Dataset" docs. Essentially, I have two sharded VCFs with different numbers of variants and samples. I think the example shown assumes the same set of variants in each VCF.
What is the best way to ensure that variants that exist in one VCF but not the other have a genotype record for every sample, defaulting to NA/no call for samples in the other VCF?