neherlab / treetime

Maximum likelihood inference of time stamped phylogenies and ancestral reconstruction
MIT License
222 stars 55 forks source link

Divide by zero error when clocking a polytomy tree containing 500 samples #214

Closed mdperry closed 1 year ago

mdperry commented 1 year ago

Hi, I am still trying to use tools in the augur package to clock trees produced by the UShER package. Previously I had encountered some unexpected errors related to subtrees with extremely long branch lengths, and I was able to get some help to solve (or workaround) that issue. However I have recently encountered what I believe are a completely different, unrelated class of errors. When I am running augur refine et al, on batches of files I discovered that the UShER subtree that throws this exception will have 500 specimens with identical Nextstrain clades and identical pangolin lineages. In other words, a polytomy. I have seen this now with both Delta, and Omicron subtrees. Here are the details:

augur refine -t newfile.nwk -a newfile.vcf --vcf-reference /cromwell_root/fc-secure-2063e6fe-3ab3-4713-9cd6-857b72d9954c/wuhCor1.fa --metadata user_subset_metadata.tsv --timetree --root min_dev --coalescent opt --output-tree augur.tree.nwk --output-node augur.branch_lengths.json --divergence-units mutations

/usr/local/lib/python3.7/site-packages/treetime/gtr.py:502: RuntimeWarning: divide by zero encountered in true_divide
  W_ij = (nij+nij.T+2*pc_mat)/mu/(np.outer(pi,Ti) + np.outer(Ti,pi)
/usr/local/lib/python3.7/site-packages/treetime/gtr.py:502: RuntimeWarning: invalid value encountered in true_divide
  W_ij = (nij+nij.T+2*pc_mat)/mu/(np.outer(pi,Ti) + np.outer(Ti,pi)
/usr/local/lib/python3.7/site-packages/treetime/gtr.py:10: RuntimeWarning: invalid value encountered in double_scalars
  return (np.einsum('i,ij,j', pi, W, pi) - np.sum(pi*W[:,gap_index])*pi[gap_index])/(1-pi[gap_index])
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/treetime/treetime.py", line 57, in run
    return self._run(**kwargs)
  File "/usr/local/lib/python3.7/site-packages/treetime/treetime.py", line 212, in _run
    max_iter=1, method_anc = method_anc, **seq_kwargs)
  File "/usr/local/lib/python3.7/site-packages/treetime/treeanc.py", line 1357, in optimize_tree
    marginal=marginal_sequences, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/treetime/treeanc.py", line 482, in reconstruct_anc
    return self.infer_ancestral_sequences(*args,**kwargs)
  File "/usr/local/lib/python3.7/site-packages/treetime/treeanc.py", line 534, in infer_ancestral_sequences
    self.infer_gtr(marginal=marginal, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/treetime/treeanc.py", line 1501, in infer_gtr
    prof_map = self.gtr.profile_map)
  File "/usr/local/lib/python3.7/site-packages/treetime/gtr.py", line 530, in infer
    gtr.assign_rates(mu=mu, W=W_ij, pi=pi)
  File "/usr/local/lib/python3.7/site-packages/treetime/gtr.py", line 224, in assign_rates
    self._eig()
  File "/usr/local/lib/python3.7/site-packages/treetime/gtr.py", line 542, in _eig
    self.eigenvals, self.v, self.v_inv = self._eig_single_site(self.W, self.Pi)
  File "/usr/local/lib/python3.7/site-packages/treetime/gtr.py", line 553, in _eig_single_site
    assert np.abs(np.diag(W).sum())<1e-10
AssertionError

ERROR:  

ERROR in TreeTime.run: An error occurred which was not properly handled in TreeTime. If this error persists, please let us know by filing a new issue including the original command and the error above at: https://github.com/neherlab/treetime/issues 

Thanks, -- Marc Perry

corneliusroemer commented 1 year ago

Thanks @mdperry - do you mean all sequences are identical?

Could you share the build files and maybe a screenshot of the tree? That'd be great!

What's the version of augur and treetime you use?

This looks like some edge case tree that's not usually encountered unless you make a tree with 12 million sequences :D Which no one thought about doing until, well, you Usher folks did...

mdperry commented 1 year ago

Hi @corneliusroemer, I will recap the workflow that led me to issue #214. We built a big tree using UShER (12 million sounds about right), then for about 550,000 samples we used matUtils to extract subtrees, each one is an auspice.us compatible JSON file, and each contains the 500 nearest neighbours of our sample. If you are keeping track, when I ran this the other day I got back over 80,000 subtrees (because some subtrees contain multiple samples from the list, and we don't have to keep re-making them).

Issue #214 was spawned by the file named CDPH-subtree-101.json, and I am going to upload that, strictly for your reference (to visualize?). The sample name is: USA/CA-CDPH-500006912/2021|EPI_ISL_5442773|2021-07-23, the Nextstrain clade is 21I (Delta), and the pangolin lineage is AY.47. To prepare the files for augur refine (and subsequent steps), I have a Perl script that parses the first JSON file, and creates a TSV metadata file; column 1 contains the names of those 500 samples (I am uploading all of these as well).

Now I run matUtils again, feeding it the same huge tree, and this list of 500 samples. For output this time, instead of asking for an auspice.us compatible JSON file I request a newick formatted tree, and a VCF file of the 500 sequences. Those are the final two inputs to the augur refine command. Oh wait, following up on your earlier tip, I take all of the UShER branch-lengths in the newick file, and divide them by 29903, before I use that for input to augur/treetime.

You asked if the 500 sequences are identical. My understanding is that in real life they are not (because of sporadic N's from sequencing), and because of private mutations (I assume). However, from UShER's point of view, since they are all placed together in the same node, I think that means that they all have the same lineage defining mutations. Also, because it is UShER, using its maximum parsimony algorithm, that means this was either the best, or among the best places to add each sample to the tree at the time of evaluation.

You asked about the software versions for augur and treetime, I am using a docker container specified by this: docker: "nextstrain/base" and I believe that gets evaluated at runtime. From the stdout, I learned that, "augur refine is using TreeTime version 0.9.4" When I looked on DockerHub I couldn't figure out which version of augur that is using/pointing at, but I am assuming (based on nothing) that it would be fairly recent (?). I ran this process last week, on OCT-11.

newfile.nwk.gz newfile.vcf.gz sample_list.txt CDPH-subtree-101.json.gz user_subset_metadata.tsv.gz

anna-parker commented 1 year ago

From having a quick look I think this error is happening because TreeTime estimates a GTR model from the input data - using mutations of the input data - however this data set has no (?) mutation diversity (all samples in the vcf file have the same mutations?). You could rerun in TreeTime with a fixed GTR model or a GTR model that was calculated from a covid tree with (more) mutations. See the gtr command: https://treetime.readthedocs.io/en/latest/commands.html?highlight=infer%20gtr. I don't think this has been added to the augur CLI as an option, @corneliusroemer or @rneher please correct me if I am wrong. Or alternatively, take a larger subset of the tree with mutations so that refine can infer a GTR model. Oh and probably the (molecular) clock rate should also be given as input to avoid any issues with estimation - for NA the clock rate is normally around 0.0028, in augur this would mean adding the flag --clock-rate 0.0028.

corneliusroemer commented 1 year ago

Ah yes, very good point - that for this dataset it would make a lot of sense to use a once-computed GTR from the whole tree rather than do it every time fresh!

@anna-parker we should probably still use this issue to add a helpful error message that can be actioned by users :)

rneher commented 1 year ago

I think Anna is correct. this problem happens because the tree of these samples has 0 length. And this edge case isn't captured in the GTR estimation.

the offending line is likely this one:

https://github.com/neherlab/treetime/blob/master/treetime/gtr.py#L514

explicitly specifying a GTR model should avoid this error, but as Anna wrote, this option isn't exposed in augur.

mdperry commented 1 year ago

Hi, Thank you @anna-parker, @corneliusroemer, and @rneher for these suggestions. By specifying the --clock-rate the timetrees I generate with augur have much shorter branch-lengths (in general), and I think that is an improvement.

RE: specifying a GTR model for the subtrees that were throwing a divide-by-zero exception, I was able to use a larger portion of the tree to generate a more generic GTR model using treetime run on the command line, but I am stymied about the correct syntax for specifying a custom GTR model when attempting to use treetime to clock these subtrees where there are 500 samples with the identical lineage call.

From the various documentation sources, examples, and tutorials, I have been able to use the --gtr argument to specify one of the standard GTR models (e.g. Jukes-Cantor), and I understand in principle that I can use the --gtr-params argument to pass in key = value pairs for mu and pi, but I don't understand how to specify that the --gtr model I want treetime to use is a custom, exogenous one, not contained in the nuc_models.py file, nor do I understand at all, how I could use key = value pairs to pass in the W nxn matrix generated by my separate --gtr infer output (so that I could use that data as the GTR model for the subtrees that it currently balks on).

Here is the information that I get from the inferred GTR model using treetime on a downsampled tree that has representative samples covering 25 nextclade clades

Substitution rate (mu): 1.0

Equilibrium frequencies (pi_i):
  A: 0.2984
  C: 0.179
  G: 0.1918
  T: 0.3205
  -: 0.0103

Symmetrized rates from j->i (W_ij):
        A       C       G       T       -
  A     0       0.2359  2.0321  0.2309  0.3015
  C     0.2359  0       0.4964  4.6721  0.4131
  G     2.0321  0.4964  0       0.9282  0.3966
  T     0.2309  4.6721  0.9282  0       0.2867
  -     0.3015  0.4131  0.3966  0.2867  0

Actual rates from j->i (Q_ij):
        A       C       G       T       -
  A     0       0.0704  0.6063  0.0689  0.09
  C     0.0422  0       0.0889  0.8365  0.074
  G     0.3898  0.0952  0       0.1781  0.0761
  T     0.074   1.4972  0.2974  0       0.0919
  -     0.0031  0.0043  0.0041  0.003   0

Inferred sequence evolution model (saved as estimation_of_time-scaled_phylogenies/molecular_clock.txt):
Root-Tip-Regression:
 --rate:        2.800e-03
 --r^2:         0.15

Any further suggestions, tips, or pointers would be greatly appreciated.

Thanks,

-- Marc

anna-parker commented 1 year ago

@mdperry sadly we do not currently have a method of reading in and using inferred custom GTR models for further inference- this is sth that would probably make sense for us to add as a feature.

As your inferred GTR model is quite different from other standard models that we have implemented (as you already mentioned) you would have to write a script to use it. As you said this would use the custom command: https://treetime.readthedocs.io/en/latest/gtr.html?highlight=custom#treetime.GTR.custom. Some examples can be found in the test folder: https://github.com/neherlab/treetime/blob/b5f5c237b4667973e492875707ee79ba4af86f8c/test/test_treetime.py#L136

Sorry to not be of more help!