Closed 00-kelvin closed 4 months ago
Dear @00-kelvin,
This is most likely due to a sequence name with "odd" characters. Would you please share the input alignment and tree so I can check? Also, please confirm your hyphy
version (hyphy --version
).
Best, Sergei
HYPHY 2.5.60(MP) for Linux on x86_64 x86 SSE4 SIMD zlib (v1.2.13)
Alignment and tree attached as a zip file. Let me know if you can't open them for whatever reason. Many thanks for your quick reply!
Dear @00-kelvin,
Yes, a sequence name issue. I wrote a little utility script to help with this, and also to allow you to map slighlty non-matching IDs in the alignment / tree files (in your case you have =
and _
mismtatches for some names). See https://github.com/veg/hyphy-analyses/tree/master/clean-names with a specific use example for your files there.
The resulting .nex
file and .BUSTED.json
file are attached.
$hyphy busted --alignment test/N1.HOG0007354.nex --error-sink Yes
....
* Log(L) = -98776.92, AIC-c = 198333.08 (389 estimated parameters)
* For *test* branches, the following rate distribution for branch-site combinations was inferred
| Selection mode | dN/dS |Proportion, %| Notes |
|-----------------------------------|---------------|-------------|-----------------------------------|
| Negative selection | 0.000 | 79.080 | |
| Negative selection | 0.102 | 20.838 | |
| Diversifying selection | 1.777 | 0.027 | |
| Error absorption |9999999171.5...| 0.056 | |
* The following rate distribution for site-to-site **synonymous** rate variation was inferred
| Rate | Proportion, % | Notes |
|-----------------------------------|---------------|-----------------------------------|
| 0.129 | 83.843 | |
| 0.339 | 15.348 | |
| 103.850 | 0.809 | |
Performing the constrained (dN/dS > 1 not allowed) model fit
* Log(L) = -98776.99, AIC-c = 198331.22 (388 estimated parameters)
* For *test* branches under the null (no dN/dS > 1 model), the following rate distribution for branch-site combinations was inferred
| Selection mode | dN/dS |Proportion, %| Notes |
|-----------------------------------|---------------|-------------|-----------------------------------|
| Negative selection | 0.000 | 79.079 | |
| Negative selection | 0.102 | 20.838 | |
| Neutral evolution | 1.000 | 0.027 | |
| Error absorption |9999999171.5...| 0.057 | |
* The following rate distribution for site-to-site **synonymous** rate variation was inferred
| Rate | Proportion, % | Notes |
|-----------------------------------|---------------|-----------------------------------|
| 0.128 | 83.843 | |
| 0.339 | 15.348 | |
| 103.844 | 0.809 | |
----
## Branch-site unrestricted statistical test of episodic diversification [BUSTED]
Likelihood ratio test for episodic diversifying positive selection, **p = 0.4643**.
The "error-absorption" component allows you to filter out local alignment issue (use https://observablehq.com/@spond/busted to load the JSON and look around).
Many automated alignment procedures leave things like the folllowing in place (amino-acids 1000-1050 in your example file), which would "light up" as selection, but are more likely misalignments/misannotation.
Best, Sergei
Many thanks, Sergei! I will definitely use the error-sink flag in the future, that seems very useful. Thank you for sorting this out for me. The clean-names tool will come in handy too, I'm sure.
Hi Sergei! Sorry to bother you again -- I could not find where you attached the .json file. Would you be able to re-send? I was able to re-run the analysis with the error sink and get results, but I just wanted to check that they matched your results. Thank you!
Thank you so much for your help, Sergei! I have another couple of questions for you:
I am curious what software you are using to visualize alignments, as shown in the screenshot you sent on 5/16? It looks like a great way to visually inspect the quality of alignments and I'd like to download it!
I noticed that other Hyphy programs that I am interested in running on the same data sets (specifically RELAX and MEME) do not have the --error-sink option that BUSTED has for filtering out alignment errors. I was wondering if you had any recommendation for software that could reproduce the effects of the --error-sink flag by removing small stretches of error post-alignment before applying HyPhy tests? Some options I am looking into are TAPER, HMMCleaner, BMGE or trimAl; have you used any of these or others with success before selection analyses?
Thank you again! I know I am getting a little off topic here, so if you would rather communicate directly via email going forward that's fine by me: crunnel2@jh.edu
Dear @00-kelvin,
The alignment viewer is shown at the bottom of the Observable BUSTED visualization notebook (https://observablehq.com/@spond/busted)
I usually don't do that much data cleaning, but in my experience even heavily cleaned alignments using the tools you mention let some residual errors bleed through. Whatever you use is fine :) My original idea for filtering within HyPhy goes something like this
(a). Run BUSTED with --error-sink Yes
.
For example
$hyphy busted --alignment tests/data/HIVvif.nex --error-sink Yes
...
* Log(L) = -3377.71, AIC-c = 6917.78 (80 estimated parameters)
* For *test* branches, the following rate distribution for branch-site combinations was inferred
| Selection mode | dN/dS |Proportion, %| Notes |
|-----------------------------------|---------------|-------------|-----------------------------------|
| Negative selection | 0.444 | 45.820 | |
| Negative selection | 0.956 | 46.886 | |
| Diversifying selection | 1.012 | 7.249 | |
| Error absorption | 1908.156 | 0.045 | |
* The following rate distribution for site-to-site **synonymous** rate variation was inferred
| Rate | Proportion, % | Notes |
|-----------------------------------|---------------|-----------------------------------|
| 0.294 | 54.428 | |
| 1.162 | 32.700 | |
| 3.575 | 12.872 | |
Performing the constrained (dN/dS > 1 not allowed) model fit
* Log(L) = -3377.71, AIC-c = 6915.73 (79 estimated parameters)
* For *test* branches under the null (no dN/dS > 1 model), the following rate distribution for branch-site combinations was inferred
| Selection mode | dN/dS |Proportion, %| Notes |
|-----------------------------------|---------------|-------------|-----------------------------------|
| Negative selection | 0.443 | 45.866 | |
| Negative selection | 0.957 | 44.071 | |
| Neutral evolution | 1.000 | 10.019 | Collapsed rate class |
| Error absorption | 1908.156 | 0.045 | |
* The following rate distribution for site-to-site **synonymous** rate variation was inferred
| Rate | Proportion, % | Notes |
|-----------------------------------|---------------|-----------------------------------|
| 0.294 | 54.428 | |
| 1.161 | 32.700 | |
| 3.574 | 12.872 | |
----
## Branch-site unrestricted statistical test of episodic diversification [BUSTED]
Likelihood ratio test for episodic diversifying positive selection, **p = 0.5000**.
(b). Run hyphy error-filter
using the BUSTED .json
.
$hyphy error-filter --json tests/data/HIVvif.nex.BUSTED.json --output tests/data/HIVvif-filtered.nex --output-json tests/data/HIVvif-filtered.json
### Filtering results
| Sequence | Filtered Sites |Filtered Proportion, %|
|--------------------------------------------------|--------------------|----------------------|
| B_LAI | 0 | 0.000 |
| B_P896 | 0 | 0.000 |
| B_YU2 | 1 | 0.052 |
| B_YU10 | 1 | 0.052 |
| B_3_7 | 0 | 0.000 |
| B_MN | 1 | 0.052 |
| B_C18MBC | 0 | 0.000 |
| B_U39_32 | 1 | 0.052 |
| B_U13_2 | 1 | 0.052 |
| B_GMK1 | 0 | 0.000 |
| B_8_18 | 0 | 0.000 |
| B_IFA10 | 0 | 0.000 |
| B_JRCSF | 0 | 0.000 |
| B_JRFL | 0 | 0.000 |
| B_PA_3799 | 0 | 0.000 |
| B_BCSG3C | 0 | 0.000 |
| B_I2 | 0 | 0.000 |
| B_SF2 | 0 | 0.000 |
| B_3202A12 | 1 | 0.052 |
| B_F12 | 0 | 0.000 |
| B_OYI | 2 | 0.104 |
| B_WEAU160 | 0 | 0.000 |
| B_NL43 | 1 | 0.052 |
| B_CAM1 | 0 | 0.000 |
| B_D31 | 0 | 0.000 |
| B_MANC | 0 | 0.000 |
| B_HAN | 1 | 0.052 |
| B_AD8 | 0 | 0.000 |
| B_RF | 3 | 0.156 |
Masked a total of **13** or 0.233% sites
(c). Take the filtered results alignment file from step (b) and feed it into MEME or other analyses.
ORIGINAL MEME
$hyphy meme --alignment tests/data/HIVvif.nex --pvalue 0.05
....
| Codon | Partition | alpha |non-syn rate (beta) distribution, rates : weights| LRT |Episodic selection detected?| # branches | List of most common codon substitutions at this site |
|:----------:|:----------:|:----------:|:-----------------------------------------------:|:----------:|:--------------------------:|:----------:|:---------------------------------------------------------------------:|
| 6 | 1 | 0.000 | 0.00/1949.24 : 1.00/0.00 | 14.135 | Yes, p = 0.0004 | 1 | [1]CAG>GCA |
| 31 | 1 | 4.705 | 2.75/82.96 : 0.38/0.62 | 5.463 | Yes, p = 0.0297 | 2 | [4]Att>Gtt|[1]aTT>aAG,aTt>aCt,atT>atC,ATt>TGt |
| 33 | 1 | 4.863 | 3.38/465.67 : 0.73/0.27 | 6.129 | Yes, p = 0.0211 | 0 | [4]aGg>aAg|[2]Agg>Ggg|[1]agG>agA,Ggg>Agg,Ggg>Cgg |
| 37 | 1 | 9.429 | 4.12/254.03 : 0.43/0.57 | 4.632 | Yes, p = 0.0456 | 0 | [1]aaA>aaC,aAa>aGa,AaA>GaC,GGa>AAa,Gga>Aga,ggA>ggG,ggA>ggT,GGT>AAG |
| 62 | 1 | 0.000 | 0.00/123.59 : 0.64/0.36 | 6.166 | Yes, p = 0.0207 | 2 | [1]gCT>gAA,gCt>gGt |
| 101 | 1 | 0.000 | 0.00/10.26 : 0.00/1.00 | 4.914 | Yes, p = 0.0394 | 4 | [4]gAc>gGc|[2]Gac>Aac,gaC>gaA |
| 109 | 1 | 0.000 | 0.00/89.18 : 0.10/0.90 | 14.480 | Yes, p = 0.0003 | 4 | [2]CtG>AtA|[1]CTG>ACA,cTg>cGg |
| 124 | 1 | 0.000 | 0.00/77.80 : 0.48/0.52 | 5.293 | Yes, p = 0.0324 | 5 | [3]Ata>Tta|[1]Ata>Cta,ATa>TCa |
FILTERED MEME (notice how some of the sites drop off compared to the original alignment)
$hyphy meme --alignment tests/data/HIVvif-filtered.nex --pvalue 0.05
| Codon | Partition | alpha |non-syn rate (beta) distribution, rates : weights| LRT |Episodic selection detected?| # branches | List of most common codon substitutions at this site |
|:----------:|:----------:|:----------:|:-----------------------------------------------:|:----------:|:--------------------------:|:----------:|:---------------------------------------------------------------------:|
| 33 | 1 | 3.823 | 3.82/452.52 : 0.81/0.19 | 6.823 | Yes, p = 0.0148 | 0 | [4]aGg>aAg|[2]Agg>Ggg|[1]agG>agA,Ggg>Agg,Ggg>Cgg |
| 37 | 1 | 11.600 | 5.09/263.32 : 0.00/1.00 | 4.576 | Yes, p = 0.0469 | 0 | [1]aaA>aaC,aAa>aGa,AaA>GaC,GGa>AAa,Gga>Aga,ggA>ggG,ggA>ggT,GGT>AAG |
| 101 | 1 | 0.000 | 0.00/11.20 : 0.00/1.00 | 4.937 | Yes, p = 0.0390 | 4 | [4]gAc>gGc|[2]Gac>Aac,gaC>gaA |
| 109 | 1 | 0.000 | 0.00/47.12 : 0.00/1.00 | 9.048 | Yes, p = 0.0048 | 2 | [2]CtG>AtA|[1]cTg>cGg |
Best, Sergei
Sergei,
Thank you as always for your quick replies! I will definitely try your strategy for filtering within Hyphy.
Calvin
Hi there!
I am testing out BUSTED on an alignment and I got an error which I cannot find anyone else running into online. This same error happened for 2 different alignments I tried to run. I think it may be related to the gene names that I have in my alignment and tree files (they are very long -- I'm planning to change them in the future) but I am not sure how/why. As far as I can tell, the gene names in the tree and the alignment files match one another. Do you know why I might be getting this error?
Here is the errors.log:
and if it helps, here is the rest of the system output from the attempt:
Thank you so much in advance for your help! Calvin