Closed nspope closed 1 year ago
Hi Leo, this should be good to go. Everything is done in C++ now. The strategy for constraining node times is:
--iterations XXX
or convergence tolerance --tolerance XXX
are reached. Each iteration requires a pass over edges.--iterations
to 0. In practice, a hybrid approach works well (set iterations to 100 or something that won't take too long). import array
import numpy as np
import tskit
ts = tskit.load("my.compressed.trees")
unconstrained_node_times = np.array(array.array('d', ts.tables.nodes.metadata.tobytes()))
# compare with constrained node times, ts.tables.nodes.time
On the whole, this strategy seems to work pretty well. Below are constrained vs unconstrained times for a large empirical tree sequence (2940 samples, 523421 trees) inferred with relate, for [left] least squares optimized (but terminated after 100 iterations); [right] not optimized (node age biased upwards to force constraint):
In this example, the compressed tree sequence has ~300,000,000 nodes, as opposed to ~1,500,000,000 nodes in the uncompressed tree sequence. The storage footprint is a bit less than the compressed *anc.gz (around ~30 Gb, down from ~40 Gb).
I think it might be worth using a schema for the metadata? I'm not sure how you do that in C, however?
Good point, thanks @hyanwong. Looks like there's a way to set it in C, guess I could construct in python and copy into the header.
Ben Jeffery is the metadata guru. He will have some tips, I'm sure. If you can use the struct codec, it will produce a much smaller tree sequence file.
By the way, this would be fantastic to get into Relate, so thanks @nspope. If you want a "quick and dirty" method, you could also just allocate times arbitrarily (e.g. in ascending order as you traverse the graph) and set the tree sequence time_units
to "uncalibrated". That would only be good for people interested in the topology, of course (although you could later run tsdate on the tree sequence, which we have previously found to give reasonable results).
Also by the way, what are you doing with the oldest node in the trees, @nspope ? By the "samples underneath" argument, all the trees will have the same grand MRCA, but this can lead to unwanted behaviour e.g. when re-dating. You may wish to do something similar to the (new) tsinfer.post_process
routine, and change to new root nodes whenever an edge leading to the root changes in the tree sequence.
Ah yes @hyanwong, I used a heuristic to split the grand TMRCA -- set a new root whenever the node IDs of its two children change. It sounds like this is similar to what tsinfer.post_process is doing?
Ah yes @hyanwong, I used a heuristic to split the grand TMRCA -- set a new root whenever the node IDs of its two children change. It sounds like this is similar to what tsinfer.post_process is doing?
Great. That's equivalent to what we now do in tsinfer
, I think (only we don't necessarily have bifurcations, so it's a little more complicated)
you could also just allocate times arbitrarily (e.g. in ascending order as you traverse the graph) and set the tree sequence
time_units
to "uncalibrated".
Makes sense! This is how the tables are originally constructed, so that sorting will get edges in the "right order" for the node age constraint estimation. I'm not sure if there's an advantage to having this as an output option, as opposed to just "brute-forcing" node ages backwards in time until the constraint is satisfied (e.g. disabling the constrained least squares part of the algo). As doing this requires only a single pass over the edge table.
I'm not sure if there's an advantage to having this as an output option, as opposed to just "brute-forcing" node ages backwards in time until the constraint is satisfied (e.g. disabling the constrained least squares part of the algo). As doing this requires only a single pass over the edge table.
If you can do it that way, then sure. I guess that's equivalent anyway (presumably the way you sort the edges is equivalent to traversing through the graph). So perhaps I would recommend setting ts.time_units
to "uncalibrated" (or tskit.TIME_UNITS_UNCALIBRATED
) if the least squares step is disabled (presumably that will mean that the times are pretty wacky)
The storage footprint is a bit less than the compressed *anc.gz (around ~30 Gb, down from ~40 Gb).
How much of this is due to the times stored in the metadata? I suspect it might be quite a lot. Look at
md = ts.tables.nodes.metadata
print(md.size * md.itemsize)
If that's quite a lot, it would be worth looking at the struct
possibility.
in this case, ~300,000,000 doubles -- so approx 2.5G -- not huge but not trivial. May just be easiest to add a flag to drop the metadata.
in this case, ~300,000,000 doubles -- so approx 2.5G -- not huge but not trivial. May just be easiest to add a flag to drop the metadata.
Oh, I see. You are dumping the doubles as raw bytes. I thought you might be JSONing them. So yes, that's as good as you are going to get unless you drop the metadata entirely. I wouldn't bother doing that for 2.5GB / 40 GB - the extra storage isn't actually slowing you down here, I think.
Also (I though I mentioned this, but can't see it), I wonder if you actually do marginally better estimating node time from a (neutrally simulated) genealogy when you do least squares constraining versus inferring independent node times in each tree? That might be a fun thing to test.
I wonder if you actually do marginally better estimating node time from a (neutrally simulated) genealogy when you do least squares constraining versus inferring independent node times in each tree?
Yeah, I was wondering this myself, as there'd be some information sharing across non-neighboring nodes induced by the constraint.
Hm, I think something may be off with the coordinate system in Convert
. I should be able to:
bedtools
or some other coordinate-format-aware tool to mask a VCF (1-based) with a BED file (0-based)TreeSequence.delete_intervals
with the intervals from the original BED to mask the tree sequence, without any risk of removing sites -- e.g. so as to correctly calculate tree span when computing downstream tree statistics.However, when I do this, sites that border masked intervals are removed. I think what is going on is that the SNP positions, tree breaks, etc in Relate's internal format use the same coordinate system as the input haplotypes (which needs to be 1-based, I think, for masking to work correctly?). Whereas tskit uses a 0-based coordinate system.
I think it would be good to document what this does. Specifically, that it uses exact matching of samples under a node to match between trees, and that it does the ultimate-ancestor-node-splitting thing (identical to step 2 described here: https://tskit.dev/tsinfer/docs/stable/api.html#tsinfer.post_process)
I'll add some details to the docstring in a followup, that could be copied over to the blurb in the Relate documentation if it seems necessary.
Hi @leospeidel,
in case it's useful -- here is a way to dump a compressed tree sequence with equivalent nodes collapsed across marginal trees, with node ages constrained so that branch lengths are positive, and with unconstrained node age stored in node metadata. I used the existing
include/src/tree_sequence.hpp::DumpTreeSequencesTopoOnly
with some modifications (a few things needed to be tweaked to generate a valid tree sequence with mutations, etc). The only change to the existing API was to add the--compress
flag toConvert
.The workflow looks like,
The "constrained node ages" are necessary because just averaging node ages across marginal trees will generally not produce a valid tree sequence (some branch lengths will be negative). So, I find the set of constrained node ages that are closest to the unconstrained node ages, with respect to least squared error. This is done via quadratic programming (w/ an efficient fixed-point algorithm based on the method of alternating projections). On the example data this looks like:
Finally, the hope is that tree shape wouldn't get too screwed up by the constraints. A way to assess this is to dump the uncompressed tree sequences (e.g. all marginal trees with nodes stored separately), and calculate some branch/TMRCA statistics across windows/points in both uncompressed and compressed+constrained output (using tskit). That looks like,
This looks like a reasonable approximation (at least for this example), takes up substantially less space, and should allow quicker stats computations from tskit.