Method for creating compressing tree sequence output, adding constrained node ages

nspope commented 2 years ago

Hi @leospeidel,

in case it's useful -- here is a way to dump a compressed tree sequence with equivalent nodes collapsed across marginal trees, with node ages constrained so that branch lengths are positive, and with unconstrained node age stored in node metadata. I used the existing include/src/tree_sequence.hpp::DumpTreeSequencesTopoOnly with some modifications (a few things needed to be tweaked to generate a valid tree sequence with mutations, etc). The only change to the existing API was to add the --compress flag to Convert.

The workflow looks like,

# Output compressed tree sequence, with averaged node ages in metadata
$RELATE_LIB/bin/Convert --mode ConvertToTreeSequence \
  --anc $RELATE_LIB/example/example.anc.gz \
  --mut $RELATE_LIB/example/example.mut.gz \
  --compress -o example.compressed.constrained

The "constrained node ages" are necessary because just averaging node ages across marginal trees will generally not produce a valid tree sequence (some branch lengths will be negative). So, I find the set of constrained node ages that are closest to the unconstrained node ages, with respect to least squared error. This is done via quadratic programming (w/ an efficient fixed-point algorithm based on the method of alternating projections). On the example data this looks like:

Finally, the hope is that tree shape wouldn't get too screwed up by the constraints. A way to assess this is to dump the uncompressed tree sequences (e.g. all marginal trees with nodes stored separately), and calculate some branch/TMRCA statistics across windows/points in both uncompressed and compressed+constrained output (using tskit). That looks like,

This looks like a reasonable approximation (at least for this example), takes up substantially less space, and should allow quicker stats computations from tskit.

nspope commented 1 year ago

Hi Leo, this should be good to go. Everything is done in C++ now. The strategy for constraining node times is:

Use a fixed point algorithm to get constrained least squares solution, until --iterations XXX or convergence tolerance --tolerance XXX are reached. Each iteration requires a pass over edges.
This gets close to the solution quite quickly, but hitting the constraint exactly can be slow for huge tree sequences. So once the algorithm terminates, the constraint is forced by biasing node ages backwards in time as necessary (the same strategy is used in tsdate). The least squares optimization can be skipped completely by setting --iterations to 0. In practice, a hybrid approach works well (set iterations to 100 or something that won't take too long).
The unconstrained node ages are stored in the node metadata. These could be accessed via,
```
import array
import numpy as np
import tskit
ts = tskit.load("my.compressed.trees")
unconstrained_node_times = np.array(array.array('d', ts.tables.nodes.metadata.tobytes()))
# compare with constrained node times, ts.tables.nodes.time
```
On the whole, this strategy seems to work pretty well. Below are constrained vs unconstrained times for a large empirical tree sequence (2940 samples, 523421 trees) inferred with relate, for [left] least squares optimized (but terminated after 100 iterations); [right] not optimized (node age biased upwards to force constraint):

In this example, the compressed tree sequence has ~300,000,000 nodes, as opposed to ~1,500,000,000 nodes in the uncompressed tree sequence. The storage footprint is a bit less than the compressed *anc.gz (around ~30 Gb, down from ~40 Gb).

hyanwong commented 1 year ago

I think it might be worth using a schema for the metadata? I'm not sure how you do that in C, however?

nspope commented 1 year ago

Good point, thanks @hyanwong. Looks like there's a way to set it in C, guess I could construct in python and copy into the header.

hyanwong commented 1 year ago

Ben Jeffery is the metadata guru. He will have some tips, I'm sure. If you can use the struct codec, it will produce a much smaller tree sequence file.

hyanwong commented 1 year ago

By the way, this would be fantastic to get into Relate, so thanks @nspope. If you want a "quick and dirty" method, you could also just allocate times arbitrarily (e.g. in ascending order as you traverse the graph) and set the tree sequence time_units to "uncalibrated". That would only be good for people interested in the topology, of course (although you could later run tsdate on the tree sequence, which we have previously found to give reasonable results).

hyanwong commented 1 year ago

Also by the way, what are you doing with the oldest node in the trees, @nspope ? By the "samples underneath" argument, all the trees will have the same grand MRCA, but this can lead to unwanted behaviour e.g. when re-dating. You may wish to do something similar to the (new) tsinfer.post_process routine, and change to new root nodes whenever an edge leading to the root changes in the tree sequence.

nspope commented 1 year ago

Ah yes @hyanwong, I used a heuristic to split the grand TMRCA -- set a new root whenever the node IDs of its two children change. It sounds like this is similar to what tsinfer.post_process is doing?

hyanwong commented 1 year ago

Ah yes @hyanwong, I used a heuristic to split the grand TMRCA -- set a new root whenever the node IDs of its two children change. It sounds like this is similar to what tsinfer.post_process is doing?

Great. That's equivalent to what we now do in tsinfer, I think (only we don't necessarily have bifurcations, so it's a little more complicated)

nspope commented 1 year ago

you could also just allocate times arbitrarily (e.g. in ascending order as you traverse the graph) and set the tree sequence time_units to "uncalibrated".

Makes sense! This is how the tables are originally constructed, so that sorting will get edges in the "right order" for the node age constraint estimation. I'm not sure if there's an advantage to having this as an output option, as opposed to just "brute-forcing" node ages backwards in time until the constraint is satisfied (e.g. disabling the constrained least squares part of the algo). As doing this requires only a single pass over the edge table.

hyanwong commented 1 year ago

I'm not sure if there's an advantage to having this as an output option, as opposed to just "brute-forcing" node ages backwards in time until the constraint is satisfied (e.g. disabling the constrained least squares part of the algo). As doing this requires only a single pass over the edge table.

If you can do it that way, then sure. I guess that's equivalent anyway (presumably the way you sort the edges is equivalent to traversing through the graph). So perhaps I would recommend setting ts.time_units to "uncalibrated" (or tskit.TIME_UNITS_UNCALIBRATED) if the least squares step is disabled (presumably that will mean that the times are pretty wacky)

hyanwong commented 1 year ago

The storage footprint is a bit less than the compressed *anc.gz (around ~30 Gb, down from ~40 Gb).

How much of this is due to the times stored in the metadata? I suspect it might be quite a lot. Look at

md = ts.tables.nodes.metadata
print(md.size * md.itemsize)

If that's quite a lot, it would be worth looking at the struct possibility.

nspope commented 1 year ago

in this case, ~300,000,000 doubles -- so approx 2.5G -- not huge but not trivial. May just be easiest to add a flag to drop the metadata.

hyanwong commented 1 year ago

in this case, ~300,000,000 doubles -- so approx 2.5G -- not huge but not trivial. May just be easiest to add a flag to drop the metadata.

Oh, I see. You are dumping the doubles as raw bytes. I thought you might be JSONing them. So yes, that's as good as you are going to get unless you drop the metadata entirely. I wouldn't bother doing that for 2.5GB / 40 GB - the extra storage isn't actually slowing you down here, I think.

hyanwong commented 1 year ago

Also (I though I mentioned this, but can't see it), I wonder if you actually do marginally better estimating node time from a (neutrally simulated) genealogy when you do least squares constraining versus inferring independent node times in each tree? That might be a fun thing to test.

nspope commented 1 year ago

I wonder if you actually do marginally better estimating node time from a (neutrally simulated) genealogy when you do least squares constraining versus inferring independent node times in each tree?

Yeah, I was wondering this myself, as there'd be some information sharing across non-neighboring nodes induced by the constraint.

nspope commented 1 year ago

Hm, I think something may be off with the coordinate system in Convert. I should be able to:

Use bedtools or some other coordinate-format-aware tool to mask a VCF (1-based) with a BED file (0-based)
Convert the masked VCF to SHAPEIT2 "haps" format
Infer marginal trees using Relate and dump these into a tree sequence
Use TreeSequence.delete_intervals with the intervals from the original BED to mask the tree sequence, without any risk of removing sites -- e.g. so as to correctly calculate tree span when computing downstream tree statistics.

However, when I do this, sites that border masked intervals are removed. I think what is going on is that the SNP positions, tree breaks, etc in Relate's internal format use the same coordinate system as the input haplotypes (which needs to be 1-based, I think, for masking to work correctly?). Whereas tskit uses a 0-based coordinate system.

hyanwong commented 1 year ago

I think it would be good to document what this does. Specifically, that it uses exact matching of samples under a node to match between trees, and that it does the ultimate-ancestor-node-splitting thing (identical to step 2 described here: https://tskit.dev/tsinfer/docs/stable/api.html#tsinfer.post_process)

nspope commented 1 year ago

I'll add some details to the docstring in a followup, that could be copied over to the blurb in the Relate documentation if it seems necessary.

leospeidel / relate_lib

Method for creating compressing tree sequence output, adding constrained node ages #4