Closed singing-scientist closed 6 years ago
Dear Nelson,
This behavior is abnormal. Could you please share the data file that generated 0 scaling? I’d like to replicate and resolve the issue. You can email me or post here.
Best, Sergei
Dear Sergei: thanks so much for your swift response. I have emailed example data to your Temple address. The data resulted in the above situation when using the Mac Terminal 'HYPHYMP' call with FUBAR (4) and all defaults. I look forward to hearing back!
Dear @cwnelson88,
I was able to identify the cause of the problem. First, in order to fix it, you need to remove the last alignment column from your alignment. Why, you might ask?
Your dataset represents a very interesting "edge-case" for dealing with ambiguous data. If you recall, FUBAR first fits nucleotide model to estimate branch lengths, and then uses these branch lengths as fixed during codon optimization. In particular, if any branch length is estimated as '0', it will always remain '0' during the FUBAR run. A '0' length branch must connect identical sequences, because the probability of evolving any change along this branch is '0'. With that in mind, consider two sequences, A and B which are identical except for the last codon (they exist in your datafile), which has ambiguous characters
>A
[some sequence] TNA
>B
[same sequence] TGN
What should the branch length be under the nucleotide model? Well, because the sequences are identical except for two 'N' characters, and 'N' is missing data (in other words, NG and AN columns constitute no evolution), the MLE for the branch length is '0'
Now consider the same data as codons and ask the question what the branch length should be. The answer is: no longer 0! That is because
TNA -> TCA or TTA (TAA and TGA are stop codons)
TGN -> TGC or TGG or TGT (TGA is a stop codon)
Which means that some evolution must take place to convert TNA to TGN, and therefore the codon branch length cannot be zero.
Putting this all together
If you remove the last column (which contains no useful information, one might say), this behavior is no longer present. FUBAR results are pasted below.
I'll need to think of a more systematic fix, so the code can automatically report that something weird is going on, and flag it as an error, or better yet, a more robust way to deal with branch length approximations.
Best, Sergei
Codon | Partition | alpha | beta | Posterior prob for positive selection |
---|---|---|---|---|
63 | 1 | 0.429 | 15.968 | Pos. posterior = 1.0000 |
64 | 1 | 0.415 | 16.207 | Pos. posterior = 1.0000 |
78 | 1 | 0.515 | 7.829 | Pos. posterior = 0.9978 |
80 | 1 | 0.524 | 2.250 | Pos. posterior = 0.9354 |
100 | 1 | 0.440 | 10.858 | Pos. posterior = 0.9998 |
105 | 1 | 0.480 | 7.665 | Pos. posterior = 0.9991 |
130 | 1 | 0.566 | 2.675 | Pos. posterior = 0.9544 |
152 | 1 | 0.457 | 2.979 | Pos. posterior = 0.9742 |
167 | 1 | 0.797 | 6.475 | Pos. posterior = 0.9859 |
168 | 1 | 0.745 | 7.180 | Pos. posterior = 1.0000 |
186 | 1 | 0.611 | 5.908 | Pos. posterior = 0.9822 |
188 | 1 | 0.504 | 13.466 | Pos. posterior = 0.9990 |
220 | 1 | 0.948 | 29.354 | Pos. posterior = 1.0000 |
294 | 1 | 0.194 | 15.371 | Pos. posterior = 1.0000 |
326 | 1 | 0.551 | 8.062 | Pos. posterior = 0.9958 |
386 | 1 | 0.483 | 5.659 | Pos. posterior = 0.9940 |
452 | 1 | 0.418 | 5.602 | Pos. posterior = 0.9978 |
462 | 1 | 0.482 | 7.359 | Pos. posterior = 0.9991 |
552 | 1 | 0.533 | 3.057 | Pos. posterior = 0.9690 |
596 | 1 | 0.560 | 5.589 | Pos. posterior = 0.9855 |
Of these, 0.23 are expected to be false positives (95% confidence interval of 0-1 )
THANK YOU immensely for the quick response and effective solution! I successfully got the same results as you when removing the terminal STOP codons for this dataset.
ONE IMPORTANT FOLLOW-UP: to test whether terminal STOP codons with ambiguous characters might have changed my results in cases where FUBAR DIDN'T obviously fail, I re-ran FUBAR for some datasets where it had seemed to work properly after trimming them for their terminal STOPs. Interestingly, it turns out that these terminal/ambiguous STOP positions were sometimes being identified as under positive selection, even when there were NO fully defined codons in the alignment (only ---, NNN, TAN, and NAA are present in the alignment). This suggests that FUBAR is considering these to be nonsynonymous differences, but that can't be right. This seems like a problem — how to circumvent? Should one replace ALL codons containing even one indeterminate nucleotide (e.g., TAN) with gaps (e.g., ---) when using FUBAR?
Example data in an email...
Dear @cwnelson88,
The result with ambigs you report is actually not unexpected. You can ignore NNN and --- (these are fully missing data), but when an alignment site contains TAN and NAA something more interesting happens.
HyPhy will represent TAN
as a mixture of TAC
and TAT
(since TAA
and TAG
are stop codons). They both code for tyrosine. NAA
will be considered as a mixture of AAA
(Lys), CAA
(Gln), GAA
(Glu). Therefore there must be non-synonymous substitutions to convert one codon to another, and there are no "direct" synonymous pathways to do so. Consequently, having a lot of NAA and TAN lined up in the same alignment column is more or less equivalent to having a lot of amino-acid variability.
Obviously this is not what you might expect to find, but it fact finding positive selection at sites like this is expected given the customary way to handle missing data. In order to avoid such behavior it would indeed be advisable to mask sites containing N's with gaps.
May I ask why your data have so many partially unresolved codons? I have not come across these scenarios before and would like to understand whether or not there is some biological meaning that could be extracted by a more refined handling of ambiguous nucleotides.
Best, Sergei
Dear @spond,
This makes perfect sense — thank you, again!
I was most surprised by this position because there was only one 'TAN' and one 'NAA' — all other sequences were '---' or 'NNN'. I would guess all '---' would be ignored, 'NNN' would by definition produce dN=dS; finally, comparisons between all possible codons represented by the two partially defined codons with missing data (TAC/TAT/AAA/CAA/GAA) tipped this codon position over the edge to dN>dS.
In other words, if I'm uncomfortable with very few defined codons (or no fully defined codons) being used to determine a site's dN/dS, I need to pre-filter the data for such instances before submitting to HYPHY. Alternatively, would it work simply to replace all N's with gaps, even if this means nearly all / all sequences at a certain position become undefined?
A major contributor to the N's here is that certain sites have been 'struck' because of problematic sequencing and/or low qualities (the particular IonTorrent protocol apparently has difficulty in some regions of this genome). One option that might be useful is to be able to specify a minimum number of defined (non-gap, non-N) nucleotides at any given codon and/or site. If the threshold is not reached, the codon/site could then be ignored.
Dear @cwnelson88,
The reason you see this level of signal codon 84 is because the two ambiguous codons must realize the required changes along two very short branches. In this case you can readily get significant result based on only two sequences, because
(amount of evolutionary change) ~ rate * time
so if you make the time (branch length) small enough, the rates will have to be increased to maximize the likelihood of observed data.
As a general suggestion, I would pre-filter your data to only include unique sequences (allowing ambiguities to match as well); there is no additional signal for selection analysis from multiple copies of identical sequences, and they just gum up the computational works. In your example.fasta
file, this will for example reduce the size of the dataset from 310 to 15 sequences.
I understand the provenance of IonTorrent N's (i.e. they are instrument artifacts); I would suggest that instead of masking the sites a priori, you could rather do a post hoc analysis where you can remove sites from results if they don't have sufficient non-N content.
Best, Sergei
Thanks again, @spond, for this extremely helpful feedback and advice! Much obliged, and excited to continue using HyPhy — now in a more informed way!
Best, Chase
When I run FUBAR on Mac, sometimes "best scaling [is] achieved for" synonymous and non-synonymous rates of 0. In such cases, for all sites alpha = beta = 0, and obviously no selection is detected. Yet there is plenty of synonymous and non-synonymous variation among the sequences (measured separately using Nei-Gojobori), and certain genes (partitions) yield rates > 0 for the same trees that sometimes yield the ubiquitous rate = 0 phenomenon.
Is rate = 0 and a = b = 0 for all codons to be trusted as a result, or does this mean something has gone wrong? Many thanks!