veg / hyphy

HyPhy: Hypothesis testing using Phylogenies
http://www.hyphy.org
Other
217 stars 69 forks source link

FUBAR: "best scaling achieved for rate" = 0; error or trustworthy result? #805

Closed singing-scientist closed 6 years ago

singing-scientist commented 6 years ago

When I run FUBAR on Mac, sometimes "best scaling [is] achieved for" synonymous and non-synonymous rates of 0. In such cases, for all sites alpha = beta = 0, and obviously no selection is detected. Yet there is plenty of synonymous and non-synonymous variation among the sequences (measured separately using Nei-Gojobori), and certain genes (partitions) yield rates > 0 for the same trees that sometimes yield the ubiquitous rate = 0 phenomenon.

Is rate = 0 and a = b = 0 for all codons to be trusted as a result, or does this mean something has gone wrong? Many thanks!

screen shot 2018-06-08 at 1 59 51 am screen shot 2018-06-08 at 2 10 09 am
spond commented 6 years ago

Dear Nelson,

This behavior is abnormal. Could you please share the data file that generated 0 scaling? I’d like to replicate and resolve the issue. You can email me or post here.

Best, Sergei

singing-scientist commented 6 years ago

Dear Sergei: thanks so much for your swift response. I have emailed example data to your Temple address. The data resulted in the above situation when using the Mac Terminal 'HYPHYMP' call with FUBAR (4) and all defaults. I look forward to hearing back!

spond commented 6 years ago

Dear @cwnelson88,

I was able to identify the cause of the problem. First, in order to fix it, you need to remove the last alignment column from your alignment. Why, you might ask?

Your dataset represents a very interesting "edge-case" for dealing with ambiguous data. If you recall, FUBAR first fits nucleotide model to estimate branch lengths, and then uses these branch lengths as fixed during codon optimization. In particular, if any branch length is estimated as '0', it will always remain '0' during the FUBAR run. A '0' length branch must connect identical sequences, because the probability of evolving any change along this branch is '0'. With that in mind, consider two sequences, A and B which are identical except for the last codon (they exist in your datafile), which has ambiguous characters

>A
[some sequence] TNA
>B
[same sequence] TGN

What should the branch length be under the nucleotide model? Well, because the sequences are identical except for two 'N' characters, and 'N' is missing data (in other words, NG and AN columns constitute no evolution), the MLE for the branch length is '0'

Now consider the same data as codons and ask the question what the branch length should be. The answer is: no longer 0! That is because

TNA -> TCA or TTA (TAA and TGA are stop codons)
TGN -> TGC or TGG or TGT (TGA is a stop codon)

Which means that some evolution must take place to convert TNA to TGN, and therefore the codon branch length cannot be zero.

Putting this all together

  1. You have sequences that differ only in the last codon with ambiguous data
  2. When treated as nucleotides these sequences give rise to zero branch lengths
  3. When treated as codons these sequences give rise to non-zero branch lengths
  4. When you try to fit data that require evolution (step 3) to trees with constrained zero branch lengths (from step 2), you end up with 0 probability of generating such data, or -infinite log likelihoods, regardless of what dS and dN are. This breaks FUBAR.

If you remove the last column (which contains no useful information, one might say), this behavior is no longer present. FUBAR results are pasted below.

I'll need to think of a more systematic fix, so the code can automatically report that something weird is going on, and flag it as an error, or better yet, a more robust way to deal with branch length approximations.

Best, Sergei

Computing the phylogenetic likelihood function on the grid

Running an iterative zeroth order variational Bayes procedure to estimate the posterior mean of rate weights

Tabulating site-level results

Codon Partition alpha beta Posterior prob for positive selection
63 1 0.429 15.968 Pos. posterior = 1.0000
64 1 0.415 16.207 Pos. posterior = 1.0000
78 1 0.515 7.829 Pos. posterior = 0.9978
80 1 0.524 2.250 Pos. posterior = 0.9354
100 1 0.440 10.858 Pos. posterior = 0.9998
105 1 0.480 7.665 Pos. posterior = 0.9991
130 1 0.566 2.675 Pos. posterior = 0.9544
152 1 0.457 2.979 Pos. posterior = 0.9742
167 1 0.797 6.475 Pos. posterior = 0.9859
168 1 0.745 7.180 Pos. posterior = 1.0000
186 1 0.611 5.908 Pos. posterior = 0.9822
188 1 0.504 13.466 Pos. posterior = 0.9990
220 1 0.948 29.354 Pos. posterior = 1.0000
294 1 0.194 15.371 Pos. posterior = 1.0000
326 1 0.551 8.062 Pos. posterior = 0.9958
386 1 0.483 5.659 Pos. posterior = 0.9940
452 1 0.418 5.602 Pos. posterior = 0.9978
462 1 0.482 7.359 Pos. posterior = 0.9991
552 1 0.533 3.057 Pos. posterior = 0.9690
596 1 0.560 5.589 Pos. posterior = 0.9855

FUBAR inferred 20 sites subject to diversifying positive selection at posterior probability >= 0.9

Of these, 0.23 are expected to be false positives (95% confidence interval of 0-1 )

singing-scientist commented 6 years ago

THANK YOU immensely for the quick response and effective solution! I successfully got the same results as you when removing the terminal STOP codons for this dataset.

ONE IMPORTANT FOLLOW-UP: to test whether terminal STOP codons with ambiguous characters might have changed my results in cases where FUBAR DIDN'T obviously fail, I re-ran FUBAR for some datasets where it had seemed to work properly after trimming them for their terminal STOPs. Interestingly, it turns out that these terminal/ambiguous STOP positions were sometimes being identified as under positive selection, even when there were NO fully defined codons in the alignment (only ---, NNN, TAN, and NAA are present in the alignment). This suggests that FUBAR is considering these to be nonsynonymous differences, but that can't be right. This seems like a problem — how to circumvent? Should one replace ALL codons containing even one indeterminate nucleotide (e.g., TAN) with gaps (e.g., ---) when using FUBAR?

Example data in an email...

spond commented 6 years ago

Dear @cwnelson88,

The result with ambigs you report is actually not unexpected. You can ignore NNN and --- (these are fully missing data), but when an alignment site contains TAN and NAA something more interesting happens.

HyPhy will represent TAN as a mixture of TAC and TAT (since TAA and TAG are stop codons). They both code for tyrosine. NAA will be considered as a mixture of AAA (Lys), CAA (Gln), GAA (Glu). Therefore there must be non-synonymous substitutions to convert one codon to another, and there are no "direct" synonymous pathways to do so. Consequently, having a lot of NAA and TAN lined up in the same alignment column is more or less equivalent to having a lot of amino-acid variability.

Obviously this is not what you might expect to find, but it fact finding positive selection at sites like this is expected given the customary way to handle missing data. In order to avoid such behavior it would indeed be advisable to mask sites containing N's with gaps.

May I ask why your data have so many partially unresolved codons? I have not come across these scenarios before and would like to understand whether or not there is some biological meaning that could be extracted by a more refined handling of ambiguous nucleotides.

Best, Sergei

singing-scientist commented 6 years ago

Dear @spond,

This makes perfect sense — thank you, again!

I was most surprised by this position because there was only one 'TAN' and one 'NAA' — all other sequences were '---' or 'NNN'. I would guess all '---' would be ignored, 'NNN' would by definition produce dN=dS; finally, comparisons between all possible codons represented by the two partially defined codons with missing data (TAC/TAT/AAA/CAA/GAA) tipped this codon position over the edge to dN>dS.

In other words, if I'm uncomfortable with very few defined codons (or no fully defined codons) being used to determine a site's dN/dS, I need to pre-filter the data for such instances before submitting to HYPHY. Alternatively, would it work simply to replace all N's with gaps, even if this means nearly all / all sequences at a certain position become undefined?

A major contributor to the N's here is that certain sites have been 'struck' because of problematic sequencing and/or low qualities (the particular IonTorrent protocol apparently has difficulty in some regions of this genome). One option that might be useful is to be able to specify a minimum number of defined (non-gap, non-N) nucleotides at any given codon and/or site. If the threshold is not reached, the codon/site could then be ignored.

spond commented 6 years ago

Dear @cwnelson88,

The reason you see this level of signal codon 84 is because the two ambiguous codons must realize the required changes along two very short branches. In this case you can readily get significant result based on only two sequences, because

(amount of evolutionary change) ~ rate * time 

so if you make the time (branch length) small enough, the rates will have to be increased to maximize the likelihood of observed data.

As a general suggestion, I would pre-filter your data to only include unique sequences (allowing ambiguities to match as well); there is no additional signal for selection analysis from multiple copies of identical sequences, and they just gum up the computational works. In your example.fasta file, this will for example reduce the size of the dataset from 310 to 15 sequences.

I understand the provenance of IonTorrent N's (i.e. they are instrument artifacts); I would suggest that instead of masking the sites a priori, you could rather do a post hoc analysis where you can remove sites from results if they don't have sufficient non-N content.

Best, Sergei

singing-scientist commented 6 years ago

Thanks again, @spond, for this extremely helpful feedback and advice! Much obliged, and excited to continue using HyPhy — now in a more informed way!

Best, Chase