Closed hyanwong closed 7 months ago
What do we see on published unified genealogy trees?
What do we see on published unified genealogy trees?
Good question. The unified genealogies are run with mismatch, so may have many mutations per site, which will make the comparison somewhat tricky, but I get a branch level diversity of ~65% of the site diversity:
import tqdm
import tskit
import tszip
import urllib.request
class DownloadProgressBar(tqdm.tqdm):
def update_to(self, b=1, bsize=1, tsize=None):
if tsize is not None:
self.total = tsize
self.update(b * bsize - self.n)
url = "https://zenodo.org/record/5512994/files/hgdp_tgp_sgdp_high_cov_ancients_chr2_q.dated.trees.tsz"
with DownloadProgressBar(unit='B', unit_scale=True, miniters=1, desc=url.split('/')[-1]) as t:
temporary_filename, _ = urllib.request.urlretrieve(url, reporthook=t.update_to)
ts_2q = tszip.decompress(temporary_filename)
urllib.request.urlcleanup() # Remove temporary_filename
ts = ts_2q.trim()
print("Site diversity / mu:", ts.diversity() / 1e-8, ". Branch diversity:", ts.diversity(mode="branch"))
hgdp_tgp_sgdp_high_cov_ancients_chr2_q.dated.trees.tsz: 138MB [00:32, 4.29MB/s]
Site diversity / mu: 45883.391488580746 . Branch diversity: 29829.380183333305
Running split_disjoint_nodes
on the UKB tree sequence before dating reduces the site vs branch difference a little, (from 45x to 32x), but it's still waaay out.
It'd be worth seeing if there's a difference on true topologies, if we collapse 0-mutation edges into polytomies. E.g. where the node age and topologies are correct, but where a subset of nodes has been removed.
I think this can be closed in favour of #370
When running tsdate on real data, we observe a major mismatch between the "normal" sitewise genetic diversity, and the branch length diversity. Normally, the branch diversity should be more-or-less equal to the (sitewise diversity divided by the mutation rate), but often it is much lower. For example in GEL it is 3x lower, and in UKB it appears, using the vanilla tsdate variational_gamma method, to be 45x lower. This manifests itself in a much shorter than expected timescale, with MRCAs in the thousands rather than tens of thousands of generations.