neherlab / treetime

Maximum likelihood inference of time stamped phylogenies and ancestral reconstruction
MIT License
223 stars 55 forks source link

Problem w/ branch len estimate with closely related leaves #134

Open phiweger opened 4 years ago

phiweger commented 4 years ago

In the TreeTime .nexus output I get a huge negative branch len followed by another large on for the corresponding leaves:

...
((7ab5cd3e-524b-4f3e-9951-04d783bcef78:28113.25709,5abe8078-fdb6-4e90-9075-314bc4238f48:28113.15326)NODE_0000024:-28112.91947,(dd518f5c-d48a-464d-bff3-4becb51ae5d5:0.00000,0e2e43b1-45fc-4a32-b584-d4db7b91e86b:0.00000)
...

Is this a bug or some numerical instability? How could I avoid this?

Thanks a lot!

phiweger commented 4 years ago

Further testing gives me the impression that (1) this does not always occur given the same input and (2) only occured when I add the --confidence flag to treetime.

rneher commented 4 years ago

it this run a tree with four leaves with some identical dates/branch lengths? then it is likely a numerical instability when trying to invert a singular matrix.

phiweger commented 4 years ago

Yes, this is a larger tree (20+ leaves) but 3 of them are identical in their SNV alignment, but the dates are different. Is there a way around this instability, besides manually clipping the corresponding branch values to 0? The dates should help resolve polytomies, right?

rneher commented 4 years ago

could you send me these data. I can't quite explain why this might happen and it would be good to fix.

phiweger commented 4 years ago

which data do you need? the alignment, dates, undated tree -- anything else?

rneher commented 3 years ago

yes, those are what I would need.

ktmeaton commented 3 years ago

I think I might be having a similar error (if not I can open a new issue). When estimating date confidences using the marginal likelihood, some nodes will sporadically have very large intervals:

image

Rather than having intervals in the range of 100s of years, these nodes have confidence intervals of +100,000 years. These large intervals are somewhat random, in that rerunning the analyses moves them around. Any thoughts on why this might be occurring and if there's a solution?

rneher commented 3 years ago

yes, this looks like there is a problem. My hunch is that there is some numerical accuracy problem.

ktmeaton commented 3 years ago

I was thinking numerical accuracy too. This is a large phylogeny with many small branches (1e-8). Would there be any value in rescaling the branch lengths before (ex. multiply them all by 1e4)?

rneher commented 3 years ago

I suppose this is a large genome? Does this use a SNP only alignment? Or a vcf file? TreeTime carries around an internal scale that is one_mutation = 1/L (L being the length of the genome). One could just try to trick it in assuming the genome is shorter. But I am not sure I understand your application well enough.

m-a-martin commented 3 years ago

I think I might be having a similar error (if not I can open a new issue). When estimating date confidences using the marginal likelihood, some nodes will sporadically have very large intervals:

image

Rather than having intervals in the range of 100s of years, these nodes have confidence intervals of +100,000 years. These large intervals are somewhat random, in that rerunning the analyses moves them around. Any thoughts on why this might be occurring and if there's a solution?

I am having this same (or a similar) issue on a SARS-CoV-2 dataset with roughly 5000 sequences using the flags, however it occurs without the covariation or branch-length-mode flags as well:

-tree ml_clean.nwk --dates clean_metadata.tsv --aln aln_clean.fasta --clock-filter 4 --reroot EPI_ISL_402125 --covariation --coalescent skyline --clock-rate 0.001 --clock-std-dev 0.0005 --branch-length-mode joint --confidence --keep-polytomies

I'm using a full alignment. The problem is random and rerunning on the same dataset can generate reasonable confidence intervals, but it happens often enough that it is an issue. Using TreeTime v. 0.80 on Python v3.9. I've attached the treetime output as well as the ML tree and a list of accession numbers (can't share alignment because GISAID data).

for_github.zip

rneher commented 3 years ago

Sorry, just started to pick this up again. All the numbers in the dates.tsv file look sensible and these should be the same as in the graph -- with the exception of those labeled as problematic branches which are masked in the dates.tsv and not in the graph. My hunch is that these long bars are essentially undefined confidence intervals of branches that don't follow the clock to an extend that we can rely on this estimation. I'll add a line to exclude these from the graph.