pkalbers / geva

Genealogical Estimation of Variant Age (GEVA)
MIT License
27 stars 0 forks source link

VCFs - chromosome scale or local subsets? #7

Open andreaswallberg opened 1 year ago

andreaswallberg commented 1 year ago

Dear @pkalbers,

Can you please advice on the structure of the input data?

Lets say I am interested in the age distribution of haplotypes in divergent loci observed across the genome, for example for genes that are encoded by two divergent haplotypes segregating in the population.

Should the VCF file contain data from the whole chromosome if possible (so as to possibly fully capture the decay of a haplotype), or is it enough to include only the approximate subsequence that mark the locus, for example the gene boundary coordinates?

Very happy for feedback here.

pkalbers commented 1 year ago

The input VCF should ideally capture variation along the whole chromosome, even if one is interested in estimating the age of a single derived variant, or variants located in a smaller region. This is because the HMM at the core of the method scans along the whole length of the chromosome to detect the most likely extent of the underlying pairwise shared haplotype (delimited by the nearest recombination breakpoints). If only a chunk of the genome is available to the method, it assumes that the "ends" of the data represent the ends of the chromosome, where the probability of recombination implicitly is infinite. While it is possible to simply use an input VCF that contains variation along a shorter stretch of the genome, the method may overestimate the TMRCA locally in pairs that share a more recent shared ancestor. This is particularly relevant if not the age of a variant but the timing of shared haplotype ancestry (i.e. through estimates of TMRCA per pairwise haplotype segments as identified by the HMM) is of interest.

andreaswallberg commented 1 year ago

Great answer. Many thanks!

In the case of a longer haplotype block, would you recommend to average estimates across multiple focal variants, or expect the method to return similar estimates for each linked variant in the block anyway?

Also, in terms of parameter misspecification, is the method very sensitive such that for example genome-average recombination rates should be avoided at all costs?

pkalbers commented 1 year ago

I guess you are referring to TMRCA estimates in your first question. If so, averaging should work well, as estimated for the TMRCA of the same pairwise shared haplotype segment should not vary much. In theory, estimates should be identical, but may vary here, because the HMM focuses on a given target position to then perform two independent runs to the left and right of that position, to detect the nearest recombination breakpoints; TMRCA is then estimated for the variation seen along the length of the detected segment. In that sense, variation in TMRCA estimates for a set of variants located on the same segment could also be used to identify outliers (I suspect these would follow a log-Normal but not sure).

Using a fixed (genome-average) recombination rate, instead of rates from a genetic map, should work well on average, but may be biased in specific cases where the haplotype is near known recombination hotspots.