neherlab / treetime

Maximum likelihood inference of time stamped phylogenies and ancestral reconstruction
MIT License
222 stars 55 forks source link

ERROR: No variation in sampling dates! Please specify your clock rate explicitly #227

Open BCMollett opened 1 year ago

BCMollett commented 1 year ago

Hi,

I am running treetime with treetime --covariation --confidence --clock-filter 5 --tree <input.nwk> --aln <input.aln.fasta> --dates <input.csv>' on a selection of N1 subtype influenza viruses and it is returning the following:

ValueError: No variation in sampling dates! Please specify your clock rate explicitly.

ERROR: No variation in sampling dates! Please specify your clock rate explicitly.

ERROR in TreeTime.run: An error occurred which was not properly handled in TreeTime. If this error persists, please let us know by filing a new issue including the original command and the error above at: https://github.com/neherlab/treetime/issues

The dataset contains sequences with dates from 2014-2021 and I have previously used the same command for N2 subtype and all other gene segments without error. I am sure all headers and dates are correct/matching

Do you have any idea/advice on how to get around this issue?

Thanks, Ben

corneliusroemer commented 1 year ago

Hi Ben, happy to help! It sounds like somewhere between you and treetime there's a misunderstanding about what the sampling dates are. Could be as simple as a different column name for your dates. But rather than speculating, the best way forward is if you share your inputs (the tree exact files), the exact command you use (copy paste) and the output of treetime --version. You can send the files to cornelius.roemer@unibas.ch if you can't share publicly.

BCMollett commented 1 year ago

Thank you for the quick reply! I am just checking the restrictions that may be in place surrounding sharing files on my end but when/if possible I will send the files through email

corneliusroemer commented 1 year ago

It should be possible to debug with a lot of columns removed to reduce scope of sharing.

You could try reducing sample numbers to 5 or so, maybe you have some public samples in there anyways, just keep these?

Otherwise, just the header of the csv could be useful - that shouldn't contain anything sensitive.

BCMollett commented 1 year ago

I have sent through the files. Did you receive them?

corneliusroemer commented 1 year ago

I have sent through the files. Did you receive them?

Yes, thanks! Just had a look. It appears that the clock-filter filters out too many tips/sequences causing some assumption somewhere to be violated. This case should probably be handled better, so thanks a lot for the report!

As a workaround you could try some of the following options:

In the future, you could try to find out more about what's going on inside treetime by passing e.g. --verbose 4 or an even higher number to see more verbose output.

A key line in the output is:

 0.90    TreeTime.clock_filter: More than a third of leaves have been excluded by
         the clock filter. Please check your input data.

When treetime runs successfully (which you can achieve by passing --clock-filter 0) you'll see why the clock filter ends up throwing out almost all of the data:

image

Almost none of the data lies in the "acceptable" regression range, unless you use large clock filter values (10+ standard deviations) or switch it off altogether). Your data deviates so much from the assumptions of the clock filter model that it fails here.

You can find this plot and other diagnostic information in the run-output folder which should appear in your working directory, see screenshot for the standard content: image

corneliusroemer commented 1 year ago

This is the full log I get with default verbosity:

treetime --covariation --confidence --clock-filter 5 --tree N1_subset.aln.clean.fasta.treefile.nwk --aln N1_subset.aln.clean.fasta --dates Matched_Metadata.csv            

Attempting to parse dates...
        Using column 'strain' as name. This needs match the taxon names in the tree!!
        Using column 'date' as date.

0.00    -TreeAnc: set-up

0.16    WARNING: Previous versions of TreeTime (<0.7.0) RECONSTRUCTED sequences of
        tips at positions with AMBIGUOUS bases. This resulted in unexpected
        behavior is some cases and is no longer done by default. If you want to
        replace those ambiguous sites with their most likely state, rerun with
        `reconstruct_tip_states=True` or `--reconstruct-tip-states`.

0.66    TreeTime.reroot: with method or node: least-squares

0.66    TreeTime.reroot: rerooting will ignore covariance and shared ancestry.

0.90    TreeTime.clock_filter: More than a third of leaves have been excluded by
        the clock filter. Please check your input data.

0.91    TreeTime.reroot: with method or node: least-squares

0.91    TreeTime.reroot: rerooting will account for covariance and shared ancestry.
Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/miniforge/base/envs/py11/lib/python3.11/site-packages/treetime/treetime.py", line 57, in run
    return self._run(**kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/py11/lib/python3.11/site-packages/treetime/treetime.py", line 221, in _run
    self.clock_filter(reroot=reroot_mechanism, n_iqd=n_iqd, plot=plot_rtt, fixed_clock_rate=fixed_clock_rate)
  File "/opt/homebrew/Caskroom/miniforge/base/envs/py11/lib/python3.11/site-packages/treetime/treetime.py", line 439, in clock_filter
    self.reroot(root=reroot, clock_rate=fixed_clock_rate)
  File "/opt/homebrew/Caskroom/miniforge/base/envs/py11/lib/python3.11/site-packages/treetime/treetime.py", line 521, in reroot
    new_root = self._find_best_root(covariation=use_cov,
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/py11/lib/python3.11/site-packages/treetime/treetime.py", line 949, in _find_best_root
    return Treg.optimal_reroot(force_positive=force_positive, slope=slope, keep_node_order=self.keep_node_order)['node']
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/py11/lib/python3.11/site-packages/treetime/treeregression.py", line 433, in optimal_reroot
    best_root = self.find_best_root(force_positive=force_positive, slope=slope)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/py11/lib/python3.11/site-packages/treetime/treeregression.py", line 340, in find_best_root
    x, chisq = self._optimal_root_along_branch(n, tv, bv, var, slope=slope)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/py11/lib/python3.11/site-packages/treetime/treeregression.py", line 396, in _optimal_root_along_branch
    chisq_grid = np.array([chisq(x) for x in grid])
                          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/py11/lib/python3.11/site-packages/treetime/treeregression.py", line 396, in <listcomp>
    chisq_grid = np.array([chisq(x) for x in grid])
                           ^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/py11/lib/python3.11/site-packages/treetime/treeregression.py", line 386, in chisq
    return base_regression(tmpQ, slope=slope)['chisq']
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/py11/lib/python3.11/site-packages/treetime/treeregression.py", line 32, in base_regression
    raise ValueError("No variation in sampling dates! Please specify your clock rate explicitly.")
ValueError: No variation in sampling dates! Please specify your clock rate explicitly.

ERROR: No variation in sampling dates! Please specify your clock rate explicitly. 

ERROR in TreeTime.run: An error occurred which was not properly handled in TreeTime. If this error persists, please let us know by filing a new issue including the original command and the error above at: https://github.com/neherlab/treetime/issues

Some things to address within treetime to make such issues easier to debug for users:

The log message 0.90 TreeTime.clock_filter: More than a third of leaves have been excluded by the clock filter. Please check your input data. is hard to spot. In this case it correctly indicates a path to the root cause, but this tip would be better in the error itself.

When that "no variant in sampling dates" error happens, it would be good to help the user by reporting the following:

BCMollett commented 1 year ago

I'm glad it was a relatively simple issue! You have given me a bit to think about with this dataset and treetime troubleshooting

Thanks so much for your assistance.