Open nspope opened 6 months ago
I think the big API question is whether we want singletons to be present in the input tree sequence or not. If the former, they could be rephased (e.g. the output might swap node labels for some mutations). If the latter, the question is how to pass them into tsdate and how to return them (e.g. should they just be used for dating? Should they be added to the mutation table?)
In general we'll have to assume singletons from a particular individual are all phased or all unphased.
I don't think the standard tsdate should mess with anything other than the node and mutation times (and metadata). There's a cleanliness to making sure that the dating algorithm doesn't mess with the number of rows in the tables.
But if a pair of same-length arrays (individual_ids, singleton_position)
is passed in, it would, I guess, be possible to pass out an array of node IDs (and times?) of the same length, specify which nodes are the most likely for the singleton at each position (or I guess you could pass out a probability of the singleton being in the first vs the second node).
You could then have a separate tsdate.util.XXX
function to add these singletons to the resulting tree sequence.
Hmm ... so this wouldn't be usable from the command-line, then? It seems more clunky to me to have a series of functions that have to be called in a particular order rather than a single invocation, with a flag phase_singletons
for example.
I also feel this way about split_disjoint_nodes, which also modifies tables, in that this seems to universally improve the performance of the dating algorithm. So it should really be done by default, not through a separate routine which will (probably) be skipped by many users.
this wouldn't be usable from the command-line, then
That's the same for input, right? I think you would input and output the data from/to files or from/to stdin/stdout? But maybe Jerome has a better idea?
it should really be done by default, not through a separate routine
There is a "tsdate.pre_process()" routine, which I think is how we should incorporate split_disjoint_nodes
. We should recommend that this is done before dating.
this wouldn't be usable from the command-line, then
Just to add, you need a way to input the data that is independent of the tree sequence (which assumes phased data), so incorporating singletons via the CLI will always involve messing with extra files. It doesn't seem too bad to me to output extra files in this case too.
Oh, and one last thought: I suppose that you could have a wrapper function that called all this stuff in the right order. So tsdate.date()
is guaranteed not to change the topology, just the times, but e.g. tsdate.date_with_pre_and_post_processing(...)
wraps all the commonly used steps together. We'd need to think of a better name, of course.
I seem to be unusually fixated on this particular API detail, so I'll work on the lower-level stuff for the moment and we can revisit once we've got an idea of how all this works in practice. But:
Just to add, you need a way to input the data that is independent of the tree sequence (which assumes phased data)
I think this is where I'm hung up -- you don't need to supply additional data if the singletons are already in the TS. If they're mapped to a edge, then they're implicitly phased. The phasing could be wrong, of course, so it's worth having a routine to correct the mapping. This seems much cleaner to me than having multiple input/output streams and utility functions, which are confusing to use and maintain.
Maybe a compromise here is to assume that only the mutations in the TS are going to be used for dating. So, singletons are included in the TS, regardless of whether they're well-phased or randomly assigned to a branch. But there's a flag to date
which makes the dating algorithm agnostic to phase of singleton branches. However the dating algorithm doesn't modify the tables-- there would be a separate utility that could adjust the phase of singletons based on the node times.
The reason I don't like a separate input/output stream for singleton mutations is because in general, we won't be able to support a mix of phased and unphased mutations in the same individual, with a principled algorithm like the one above. What do you do if the input TS was subset down from tsinfer output, so that there's singletons already in there? Throw an error, in which case we'd need another utility function to strip singleton mutations? Silently assume that they're unphased and lump them in with the additional input stream of singleton mutations? IMO it's nicer to assume the data is wrapped in a single input, and have it so that additional arguments modify the behavior of the algorithm rather than augment the input.
I suppose that you could have a wrapper function that called all this stuff in the right order
This is a nice idea!
If they're mapped to a edge, then they're implicitly phased.
I guess this is the main problem I have. I'm not convinced we should be adding unphased things to the tree sequence by forcing a random phasing. Or at least, we would need to think hard about how unphased stuff should be represented in a tree sequence. I think @jeromekelleher and/or @petrelharp should comment here though. There might be a way to put a mutation in a table without an associated node (perhaps tskit.NULL in the node field?) so that it doesn't actually change the genotypes. I think this is a much bigger issue though.
How about we make the internal functions first, and decide about the external API later? The internal function would be based on numpy arrays as input and output, taking an input tree sequence as immutable? So, write the absolute minimal core piece of functionality in a way that is convenient and non-destructive with tests, and then we can talk about the external API more fruitfully I think.
I agree with @nspope that it should be easy to do the right things here from a CLI invocation.
That seems like an excellent plan @jeromekelleher
Finally got this working... first taking a look at ages of singleton mutations. This is the usual equillibrium population, human-ish parameters setting. We have to use the mutation posterior in this case. I'm comparing against the branch midpoint on the true trees to avoid randomness that we can't hope to infer.
Correct vs phase-agnostic, with true topologies:
and with inferred topologies:
Looks pretty much identical!
For a baseline, here's the original algorithm but with a random phase assigned to singletons:
True trees,
Inferred trees,
Wow. This is very neat, and convincing too. Does adding unphased singletons make the algorithm more numerically unstable?
I think I've got the numerics under control (previous issues were a bug). But we'll see, I think the GEL 40K is a good litmus test
Holy shit, that looks great.
Whoops, mistake on my part. These definitely looked too good to be true. Here's the correct comparison,
Still pretty good. While we integrate over phase to get mutation age, we can also get MAP estimates of what the phase should be. This gets it right ~80% of the time in the example above.
To test the dating performance on singletons with new phase-agnostic algorithm, I used the 1500 individual subset I made of the 1kgp-chr20p data. That way, I could use the singletons zarr from here which has a column for the true phase. I first had to fix an oversight: there were ~300 sites with two alternatives alleles that are both singletons, so I just kept one of them in each case.
import tsdate
import tszip
import zipfile
import os
import xarray as xr
import pandas as pd
import numpy as np
zip_path = 'data/1kgp_bal1500_singletons.zarr.zip'
output_path = os.path.splitext(zip_path)[0]
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
print(f'Unzipping to {output_path}')
zip_ref.extractall(output_path)
ds = xr.open_dataset(output_path, engine='zarr')
ts = tszip.decompress('data/1kgp_bal1500-chr20p-filterNton23-truncate-0-0-0-mm0-post-processed.trees.tsz')
ts = ts.simplify()
ts = tsdate.util.split_disjoint_nodes(ts)
#Keep just one allele for multi-allelic singletons
dupe_pos = ds['position'].values
duplicate_mask = ~pd.Series(dupe_pos).duplicated(keep='first').values
assert(np.sum(duplicate_mask) == len(np.unique(dupe_pos)))
duplicate_mask_xarray = xr.DataArray(duplicate_mask,
dims=['variants'],
name='duplicate_position_mask')
ds.update({'duplicate_position_mask': duplicate_mask_xarray})
#construct arrays for adding singletons
dedupe_ds = ds.where(ds.duplicate_position_mask, drop=True)
stn_pos = dedupe_ds['position'].values
stn_sample_id = dedupe_ds['sample_id'].values
stn_ancestral = dedupe_ds['allele'][:,0].values
stn_derived = dedupe_ds['allele'][:,1].values
#map sample_ids to individual ids
sample_id_dict = {}
for individual in ts.individuals():
sample_id = individual.metadata['sample_id']
sample_id_dict[sample_id] = individual.id
stn_individ = np.array([sample_id_dict[sample_id] for sample_id in stn_sample_id])
unphased_ts = tsdate.phasing.insert_unphased_singletons(ts,
position = stn_pos,
individual = stn_individ,
ancestral_state = stn_ancestral,
derived_state = stn_derived
)
There were 303 339 singletons to add to the ts. The insert_phased_singletons
function only took a few seconds on my laptop, so performance doesn't seem to be an issue @nspope. One odd issue I had was that split_disjoint_nodes
failed if I did it after adding the singletons, hence why I simplified and split disjoint nodes first.
It seems to be a problem with the node id assigned to the singletons that are added, but I haven't looked into it beyond that. Anyway, dating with singletons_phased=False
was fast on my laptop in this case. The main issue is that 60% of the singletons are removed due to having 'invalid posteriors', which is higher than I expected:
import logging
logging.getLogger().setLevel(logging.INFO)
unphased_dated_ts = tsdate.date(unphased_ts, mutation_rate=1.29e-08, singletons_phased=False)
2024-06-18 11:39:13,481 [1640] INFO root: Extracted mutations in 0.09639501571655273 seconds
2024-06-18 11:39:13,977 [1640] INFO root: Found 119208 unphased singleton mutations
2024-06-18 11:39:13,978 [1640] INFO root: Split unphased singleton edges into 1048995 blocks
2024-06-18 11:39:13,980 [1640] INFO root: Phased singletons in 0.4944934844970703 seconds
2024-06-18 11:39:56,218 [1640] INFO root: Skipped 0 edges with invalid factors
2024-06-18 11:39:56,219 [1640] INFO root: Calculated node posteriors in 41.7481415271759 seconds
2024-06-18 11:39:56,630 [1640] INFO root: Skipped 184131 mutations with invalid posteriors
2024-06-18 11:39:56,631 [1640] INFO root: Calculated mutation posteriors in 0.40827155113220215 seconds
2024-06-18 11:39:56,654 [1640] INFO root: Switched phase of 36525 singletons
2024-06-18 11:40:10,803 [1640] INFO root: Timescale rescaled in 14.147191524505615 seconds
2024-06-18 11:40:10,874 [1640] INFO root: Modified ages of 938 nodes to satisfy constraints
2024-06-18 11:40:10,919 [1640] INFO root: Set ages of 184131 nonsegregating mutations to root times.
2024-06-18 11:40:10,945 [1640] INFO root: Constrained node ages in 0.08892679214477539 seconds
2024-06-18 11:40:10,956 [1640] INFO root: Set metadata schema on NodeTable
2024-06-18 11:40:14,065 [1640] INFO root: Set metadata schema on MutationTable
2024-06-18 11:40:18,648 [1640] INFO root: Inserted node and mutation metadata in 7.70140266418457 seconds
2024-06-18 11:40:20,790 [1640] INFO root: Sorted tree sequence in 2.140110731124878 second
I then adapted the convenience function for adding singletons to add singletons with the true phases (0/1) that I stored in the zarr:
Dating with singletons_phased=True
is still fast, but again 60% of singletons are skipped:
phased_dated_ts = tsdate.date(phased_ts, mutation_rate=1.29e-08, singletons_phased=True)
2024-06-18 12:03:42,365 [1640] INFO root: Extracted mutations in 0.09936237335205078 seconds
2024-06-18 12:03:42,523 [1640] INFO root: Found 0 unphased singleton mutations
2024-06-18 12:03:42,525 [1640] INFO root: Split unphased singleton edges into 0 blocks
2024-06-18 12:03:42,526 [1640] INFO root: Phased singletons in 0.1573958396911621 seconds
2024-06-18 12:04:23,682 [1640] INFO root: Skipped 0 edges with invalid factors
2024-06-18 12:04:23,684 [1640] INFO root: Calculated node posteriors in 40.6736102104187 seconds
2024-06-18 12:04:24,023 [1640] INFO root: Skipped 184131 mutations with invalid posteriors
2024-06-18 12:04:24,024 [1640] INFO root: Calculated mutation posteriors in 0.33773255348205566 seconds
2024-06-18 12:04:24,027 [1640] INFO root: Switched phase of 0 singletons
2024-06-18 12:04:37,877 [1640] INFO root: Timescale rescaled in 13.847219467163086 seconds
2024-06-18 12:04:37,942 [1640] INFO root: Modified ages of 936 nodes to satisfy constraints
2024-06-18 12:04:37,975 [1640] INFO root: Set ages of 184131 nonsegregating mutations to root times.
2024-06-18 12:04:37,996 [1640] INFO root: Constrained node ages in 0.07012248039245605 seconds
2024-06-18 12:04:37,998 [1640] INFO root: Set metadata schema on NodeTable
2024-06-18 12:04:41,254 [1640] INFO root: Set metadata schema on MutationTable
2024-06-18 12:04:45,927 [1640] INFO root: Inserted node and mutation metadata in 7.9303083419799805 seconds
2024-06-18 12:04:48,069 [1640] INFO root: Sorted tree sequence in 2.1404826641082764 seconds
Finally, I used evaluations.mutations_time
to compare the times of singletons with true phase (x) to the phase-agnostic singleton times:
I then generated a random Bernoulli vector to use for the phase and ran dating with singletons_phased=True
(in hindsight I realise I could have just used the arbitarily assigned phases from insert_unphased_singletons
instead):
Qualitatively, they seem similar to the simulation results, just with a tonne more variance. @nspope, is the high proportion of variants with invalid posteriors also surprising to you? If you'd like to investigate it further, here is the TS with the correctly phased singletons added. Let me know if there are other files you want:
1kgp_bal1500_phased_singletons.trees.zip
I'm going to test it in GEL next.
Just remembered that I can't install from GitHub in GEL anymore, so I unfortunately can't test it until they approve the Airlock transfer of this branch of tsdate, annoyingly! I just submitted the request, so it should be sorted within two days.
Thanks @duncanMR, very helpful and a lot to chew on.
is the high proportion of variants with invalid posteriors also surprising to you?
could you check the count of nonsegregating mutations (above the root)? ideally those'll be the only ones where we can't assign a posterior.
Thanks @duncanMR, very helpful and a lot to chew on.
is the high proportion of variants with invalid posteriors also surprising to you?
could you check the count of nonsegregating mutations (above the root)? ideally those'll be the only ones where we can't assign a posterior.
I see: the number of mutations above root matches the number with invalid posteriors. So it's the root-to-sample edges problem rearing it's head again! It's concerning how many there are here.
it's the root-to-sample edges problem
Oh -- I think I missed the discussion of this previously. What is going on? Are these not simply mutations that are segregating in the full data but fixed in the subset?
it's the root-to-sample edges problem
Oh -- I think I missed the discussion of this previously. What is going on? Are these not simply mutations that are segregating in the full data but fixed in the subset?
Woops, just realised my mistake. The problem is simply that many of the singletons I extracted fall outside the range of site positions in the TS, because I extracted them directly from the zarr and forgot to limit the positions by the site density mask. So mutations were being placed onto nodes in the first and last null trees in the TS that cover the range of the genome that's been masked out. So it's nothing to worry about, sorry for the false alarm!
Aha - of course! Easy mistake to make. We should catch this and error out, I think.
We should catch this and error out, I think
Do you think this is a problem @hyanwong? It's not unusual to have some "above-the-root" mutations introduced by tsinfer (why, I don't know).
@duncanMR two things that occurred to me -- we should be using mutation posteriors here; as that takes a weighted average of the two branch midpoints (posterior mean) rather than always choosing the longest branch (MAP estimate). I'd also be interested in seeing the parent node age-- an upper bound on mutation age may be more of interest if we're focusing on old singletons
We should catch this and error out, I think
Do you think this is a problem @hyanwong? It's not unusual to have some "above-the-root" mutations introduced by tsinfer (why, I don't know).
No, I mean we should error out if if singletons are being added into regions where there are no edges
I assume the "above the root" mutations are added by parsimony, when e.g. we force the wrong ancestral state for a triallelic site
@duncanMR two things that occurred to me -- we should be using mutation posteriors here; as that takes a weighted average of the two branch midpoints (posterior mean) rather than always choosing the longest branch (MAP estimate). I'd also be interested in seeing the parent node age-- an upper bound on mutation age may be more of interest if we're focusing on old singletons
Cool, I'll look into that today. I've also got tsdate through airlock so I'll test in GEL today too.
I assume the "above the root" mutations are added by parsimony, when e.g. we force the wrong ancestral state for a triallelic site
Yeah, above-the-root mutations are extremely rare in practice: I've only seen a handful of them in the largest tree sequences. Hopefully we can figure out a solution in the long-term with my DPhil project.
When adding singletons in GeL that Sam extracted, there are sites where the alternative allele (e.g. A) is a singleton, but there is another alternate allele (e.g. T) at the same site that is at a higher frequency and already in the TS. Do we want to exclude these singletons from being added? It seems a bit inconsistent to exclude multi-allelics from inference, but allow multi-allelics if one of the alleles is a singleton. There are quite a few: 9250 sites have this problem in chr17p alone.
@nspope Adding singletons is failing for chr17p in GEL. It's a bit annoying that it didn't show up in the 1kgp data, since I can't share a reproducible example. It looks like a numerical issue: a mutation phase was generated outside [0,1]. I can try and set up a debugger to find out what the problematic phase is.
@duncanMR thanks Duncan. I can probably just clamp values to that range, if it's in the range of fp error. Would you mind inserting a print statement to see how far out of wack it is?
It turns out that it's not an issue where the phase is within fp error of 0 or 1. After updating my singletons filtering procedure, I now have 1 million singletons to add to the ~1million variants already in the chr17p TS. Of those, there are three singletons which for some reason have invalid posteriors, but they aren't above the root this time and aren't in null trees. They are on short branches at the bottom of trees.
I first tried removing the three sites, but bizarrely, five more sites popped up with invalid posteriors. The quick fix was easy: the bug is that the invalid posteriors cause there to be three nan
values in the self.mutation_phase
array, which then get passed into reallocate_unphased
and trigger the assert statements. All you have to do is check for nan
values and skip them. For some reason, this problem doesn't occur with invalid posteriors due to mutations being placed in null trees. So it seems to work fine now:
@nspope I will add some comments to the PR where I've had issues so far.
When adding singletons in GeL that Sam extracted, there are sites where the alternative allele (e.g. A) is a singleton, but there is another alternate allele (e.g. T) at the same site that is at a higher frequency and already in the TS. Do we want to exclude these singletons from being added? It seems a bit inconsistent to exclude multi-allelics from inference, but allow multi-allelics if one of the alleles is a singleton. There are quite a few: 9250 sites have this problem in chr17p alone.
We don't ignore multiallelics in tsdate though, do we? True, they aren't used the help tsinfer inference, but once they are placed back on the ts by parsimony, they can be used for dating. I don't know if we do this for the GEL data though: it could be that Ben excludes them from the pipeline entirely.
When adding singletons in GeL that Sam extracted, there are sites where the alternative allele (e.g. A) is a singleton, but there is another alternate allele (e.g. T) at the same site that is at a higher frequency and already in the TS. Do we want to exclude these singletons from being added? It seems a bit inconsistent to exclude multi-allelics from inference, but allow multi-allelics if one of the alleles is a singleton. There are quite a few: 9250 sites have this problem in chr17p alone.
We don't ignore multiallelics in tsdate though, do we? True, they aren't used the help tsinfer inference, but once they are placed back on the ts by parsimony, they can be used for dating. I don't know if we do this for the GEL data though: it could be that Ben excludes them from the pipeline entirely.
My impression was that we exclude all multi-allelic sites instead of just keeping one allele at random, but we can check with Ben later.
We don't need to keep one at random: tsinfer can place non-inference sites back into the tree sequence using parsimony. They will then be used by tsdate.
It might be that multiallelics are discarded by the snakemake pipeline before tsinfer actually sees them, and are not placed back into the tree sequence using parsimony. There is an argument that perhaps we should (or if not, we should adjust the mutation rate to take into account the fact that some sites have been discarded)
That's a good point. We definitely aren't placing any multi-allelics by parsimony at the moment (there is one mutation per site) but I don't know if they are available in the zarr to be added later. If that's the case, then I think adding back all the multi-allelics (not just singletons) makes sense.
I suspect they are in the zarr but masked out. @benjeffery will know. We can ask at 4.30 today.
After adding and dating the unphased singletons to 1kgp-chr20p, the upper bound estimates (parent time) of singleton ages have a much lighter tail than posterior means of higher frequency variants. Seems reasonable to me:
I'm going to look at the 1kgp annotations dataframe next.
In case anyone needs them, I fixed the problem with the singletons being placed outside the range of the TS. Here is the undated TS with phased singletons added:
1kgp_bal1500_phased_singletons.trees.zip
The dated TS is too big for Github so you can get it here: https://drive.google.com/drive/folders/1vn5xMlBBXTB-_sosX9LpaQ_45E36UVoa?usp=drive_link
Using a rough analysis, I reckon our current singleton phasing approach misses about 25% of the information we could incorporate if we also added haplotype lengths. See https://github.com/tskit-dev/tsdate/issues/5#issuecomment-2186513151
I don't think this is enough to warrant modifying out existing approach, but something like this might make an interesting observation in the paper discussion.
It'll frequently be the case that singletons are not phased. Singleton variants aren't used by tsinfer, but are crucial for correct dating. Here's how we'll handle unphased singletons:
Split edges leading to the same individual into "phase blocks" -- e.g. the intersection of intervals for these edges. For example, over some region of the genome the pair of edges leading to the same individual might have coordinates $[l_i, r_i]$ and $[l_j, r_j]$. The phase block for these two edges would have coordinates $[\max(l_i, l_j), \min(r_i, r_j)]$.
Use an expectation propagation update by phase block rather than separately for the two constituent edges. For example, say we have a phase block where the parents of the constituent edges have ages $t_i$ and $tj$. The children of the edges are both present-day haplotypes (of age 0). If the phase block has (mutation-weighted) span $\mu$ and contains $y$ singleton mutations, then the total mutational target is $\lambda{ij} = \mu (t_i + tj)$ and the mutation likelihood is $y \sim \mathrm{Poisson}(\lambda{ij})$.
So, if we have variational posteriors $t_i \sim \mathrm{Gamma}(\alpha_i, \beta_i), t_j \sim \mathrm{Gamma}(\alpha_j, \beta_j)$, the partition function used in the expectation propagation update for the phase block is, $$\int_0^{\infty} \int_0^{\infty} t_i^{\alpha_i - 1} e^{-\beta_i t_i} t_j^{\alpha_j - 1} e^{-\beta_j t_j} (t_i + t_j)^{y} e^{-\mu (t_i + t_j)} dt_i dt_j = \mathcal{B}(\alpha_i, \alpha_j) \frac{\Gamma(\alpha_i + \alpha_j + y)}{(\beta_i + \mu)^{\alpha_i + \alpha_j + y}} {_2} F_1 (\alpha_j, \alpha_i + \alpha_j + y; \alpha_i + \alpha_j; 1 - \frac{\beta_j + \mu}{\beta_i + \mu})$$ and we get the updated moments of $t_i, t_j$ by differentiating this wrt $\alpha, \beta$ as usual.
This gets substantially more complicated if the individual in question is not in the present day (especially if the age of the individual is unknown). So for now we'll have to assume that mutations on ancient samples are phased (I think these are often pseudohaploid sequences anyways).