Reason for bias in young mutations

hyanwong commented 3 months ago

The mutation.metadata['mn'] value which is the mean of the gamma fit to mutation time seems to be an overestimate for low times (e.g. for singleton mutations, in orange in the LHS plot below, which is a vanilla msprime simulation). This is also true for mutation.time (which is the midpoint of parent & child times, and not shown in the plot below). It would be nice to explain why this is in the docs. I assume it's to do with the fit of the gamma when the true time is very close to 0 (it doesn't happen with node times, RHS plot below)

download

Clearly the mutation at time 0.01 generations is a bit nuts as a real value, and presumably reflects the problem of using the mean of the gamma to represent a time distribution. Is this also the explanation for the general bias of orange points to be above the y=x line on the RHS?

Code for plot

```python import msprime import stdpopsim import tskit import numpy as np import tsdate from matplotlib import pyplot as plt import script multiplier = 0.1 # Set to 1 to simulate the whole chromosome species = stdpopsim.get_species("HomSap") model = species.get_demographic_model("OutOfAfrica_3G09") contig = species.get_contig("chr20", mutation_rate=model.mutation_rate, length_multiplier=multiplier) samples = {"YRI": 20, "CHB": 20, "CEU": 20} engine = stdpopsim.get_engine("msprime") ts = engine.simulate(model, contig, samples, seed=123) av_Ne = ts.diversity() / (4 * model.mutation_rate) # Replace ts with a constant-Ne model ts = msprime.sim_ancestry(60, population_size=av_Ne, sequence_length=contig.length, recombination_rate=contig.recombination_map.mean_rate, random_seed=123) ts = msprime.sim_mutations(ts, rate=model.mutation_rate, random_seed=123) dated_ts = tsdate.date(ts, mutation_rate=model.mutation_rate) def get_times(ts, inferred_ts, use_parent_node=False, use_metadata_time=True): # return x, y, is_singleton site times. Should also deal with e.g. relate tree seqs # so is a little more complex than it could be (e.g. using trimmed_ts site_id = 0 inferred_site_id = 0 x = np.full(inferred_ts.num_sites, np.nan) y = np.full(inferred_ts.num_sites, np.nan) is_singleton = np.zeros(inferred_ts.num_sites, dtype=bool) variant = tskit.Variant(ts) variant.decode(site_id) trimmed_ts = ts.keep_intervals([[0, inferred_ts.sequence_length]]).rtrim() use_site = np.isin(trimmed_ts.sites_position, inferred_ts.sites_position) for interval, orig_tree, inf_tree in trimmed_ts.coiterate(inferred_ts): while variant.site.position < interval.right: site = variant.site if use_site[site.id]: inferred_site = inferred_ts.site(inferred_site_id) if len(site.mutations) == 1 and len(inferred_site.mutations) == 1: if use_parent_node: x[inferred_site_id] = trimmed_ts.nodes_time[orig_tree.parent(site.mutations[0].node)] y[inferred_site_id] = inferred_ts.nodes_time[inf_tree.parent(inferred_site.mutations[0].node)] else: x[inferred_site_id] = site.mutations[0].time if use_metadata_time: y[inferred_site_id] = inferred_site.mutations[0].metadata["mn"] else: y[inferred_site_id] = inferred_site.mutations[0].time for allele, count in variant.counts().items(): if allele != None and count == 1: is_singleton[inferred_site_id] = True inferred_site_id += 1 site_id += 1 if site_id == ts.num_sites: assert inferred_site_id == inferred_ts.num_sites use = np.isfinite(x) x = x[use] y = y[use] is_singleton = is_singleton[use] return x, y, is_singleton variant.decode(site_id) fig, axes = plt.subplots(1, 2, figsize=(12, 5), sharey=True, sharex=True) axes[0].set_ylabel("log10 inferred time") for ax, subtitle, use_parent in zip(axes, (" (mutation time)", " (parent node time)"), (False, True)): x, y, is_singleton = get_times(ts, dated_ts, use_parent_node=use_parent) ax.scatter(np.log10(x), np.log10(y), alpha = 0.1, c=np.where(is_singleton, "tab:orange", "tab:blue")) ax.axline((0, 0), slope=1, c='k') ax.set_xlabel("log10 true time") ax.set_xlim(-2, 6) ax.text(-1, 5, f"rho: {scipy.stats.spearmanr(np.log10(x), np.log10(y)).statistic}") ax.text(-1, 4.75, f"r^2: {(np.corrcoef(np.log10(x), np.log10(y))[0, 1])**2}") ax.text(-1, 4.5, f"bias: {np.mean(np.log10(y) - np.log10(x))}") ax.set_title("true topology w/ phased singletons" + subtitle) ```

nspope commented 3 months ago

Isn't this just a consequence of using the midpoint on a log scale? What happens if you plot the estimate vs the midpoint from the true trees?

Also, if you use hexbin, I think you'll see that the majority of mutations form a band around the expectation, but there is an upwards skew (which is expected, and well-modeled by a gamma)

nspope commented 3 months ago

Just to clarify, I think there's a point past which you can't date mutations -- there's no way to know where exactly they fall on a branch. A better way to measure accuracy in this case is via coverage -- does the mutation posterior cover the true value at an appropriate rate, given a particular quantile?

E.g. get the gamma parameters, use scipy.stats.gamma.ppf to get a confidence interval for quantile p, ask if a proportion p of true mutation times fall within their predicted quantiles.

hyanwong commented 3 months ago

get the gamma parameters, use scipy.stats.gamma.ppf to get a confidence interval for quantile p, ask if a proportion p of true mutation times fall within their predicted quantiles.

@qianfeng2 may be interested in giving this a whirl. I also think that the closely related sampling suggestion on slack is useful:

for singleton mutations, you can easily sample from something closer to the truth -- sample parent node time, then sample a uniform from [0, parent node time]. A quick look at sampled uniform(0, gamma(...)) looks like it's subexponential or pareto-like, so i'd think gamma approximation would be ok. but easy enough to sample...

tskit-dev / tsdate

Reason for bias in young mutations #413