tskit-dev / tsdate

Infer the age of ancestral nodes in a tree sequence.
MIT License
19 stars 10 forks source link

Mismatch between site and branch diversity calculations #366

Closed hyanwong closed 7 months ago

hyanwong commented 7 months ago

When running tsdate on real data, we observe a major mismatch between the "normal" sitewise genetic diversity, and the branch length diversity. Normally, the branch diversity should be more-or-less equal to the (sitewise diversity divided by the mutation rate), but often it is much lower. For example in GEL it is 3x lower, and in UKB it appears, using the vanilla tsdate variational_gamma method, to be 45x lower. This manifests itself in a much shorter than expected timescale, with MRCAs in the thousands rather than tens of thousands of generations.

jeromekelleher commented 7 months ago

What do we see on published unified genealogy trees?

hyanwong commented 7 months ago

What do we see on published unified genealogy trees?

Good question. The unified genealogies are run with mismatch, so may have many mutations per site, which will make the comparison somewhat tricky, but I get a branch level diversity of ~65% of the site diversity:

import tqdm
import tskit
import tszip
import urllib.request

class DownloadProgressBar(tqdm.tqdm):
    def update_to(self, b=1, bsize=1, tsize=None):
        if tsize is not None:
            self.total = tsize
        self.update(b * bsize - self.n)

url = "https://zenodo.org/record/5512994/files/hgdp_tgp_sgdp_high_cov_ancients_chr2_q.dated.trees.tsz"

with DownloadProgressBar(unit='B', unit_scale=True, miniters=1, desc=url.split('/')[-1]) as t:
    temporary_filename, _ = urllib.request.urlretrieve(url, reporthook=t.update_to)
    ts_2q = tszip.decompress(temporary_filename)
    urllib.request.urlcleanup() # Remove temporary_filename

ts = ts_2q.trim()
print("Site diversity / mu:", ts.diversity() / 1e-8, ".   Branch diversity:", ts.diversity(mode="branch"))
hgdp_tgp_sgdp_high_cov_ancients_chr2_q.dated.trees.tsz: 138MB [00:32, 4.29MB/s]                              
Site diversity / mu: 45883.391488580746 .   Branch diversity: 29829.380183333305
hyanwong commented 7 months ago

Running split_disjoint_nodes on the UKB tree sequence before dating reduces the site vs branch difference a little, (from 45x to 32x), but it's still waaay out.

nspope commented 7 months ago

It'd be worth seeing if there's a difference on true topologies, if we collapse 0-mutation edges into polytomies. E.g. where the node age and topologies are correct, but where a subset of nodes has been removed.

hyanwong commented 7 months ago

I think this can be closed in favour of #370