veg / evogenomics_hyphy

1 stars 0 forks source link

BUSTED conclusions do not match tutorial #1

Closed singing-scientist closed 2 months ago

singing-scientist commented 2 months ago

Greetings, and thank you so much for the great Spielman et al. 2017 tutorial! As I'm working through using HYPHY 2.5.62(MP) for Darwin on arm64 ARM Neon SIMD zlib (v1.2.12), I get very different results for the first exercise. Specifically, I get the following output for 1 Selection / 5 BUSTED:

>Select a coding sequence alignment file (`/Users/chasenelson/Documents/GitHub/evogenomics_hyphy/datasets/`) ksr2.fna
>Loaded a multiple sequence alignment with **9** sequences, **949** codons, and **1** partitions from `/Users/chasenelson/Documents/GitHub/evogenomics_hyphy/datasets/ksr2.fna`

>branches => All

>srv => Yes
The number omega rate classes to include in the model (permissible range = [1,10], default value = 3, integer): 
>rates => 3

>multiple-hits => None
The number alpha rate classes to include in the model [1-10, default 3] (permissible range = [1,10], default value = 3, integer): 
>syn-rates => 3

>error-sink => No
The number of points in the initial distributional guess for likelihood fitting (permissible range = [1,10000], default value = 250, integer): 
>grid-size => 250
The number of initial random guesses to 'seed' rate values optimization (permissible range = [1,25], default value = 1, integer): 
>starting-points => 1

### Branches to test for selection in the BUSTED analysis
* Selected 15 branches to test in the BUSTED analysis: `HUM, PAN, Node6, GOR, Node5, PON, Node4, GIB, Node3, MAC, BAB, Node12, Node2, MAR, BUS`

### Obtaining branch lengths and nucleotide substitution biases under the nucleotide GTR model

>kill-zero-lengths => Yes
* Log(L) = -5768.11, AIC-c = 11582.26 (23 estimated parameters)
* 1 partition. Total tree length by partition (subs/site)  0.121

### Obtaining the global omega estimate based on relative GTR branch lengths and nucleotide substitution biases
* Log(L) = -5342.82, AIC-c = 10745.85 (30 estimated parameters)
* 1 partition. Total tree length by partition (subs/site)  0.134
* non-synonymous/synonymous rate ratio for *test* =   0.0341

### Improving branch lengths, nucleotide substitution biases, and global dN/dS ratios under a full codon model
* Log(L) = -5333.75, AIC-c = 10727.71 (30 estimated parameters)
* non-synonymous/synonymous rate ratio for *test* =   0.0306

### Performing the full (dN/dS > 1 allowed) branch-site model fit
* Log(L) = -5300.23, AIC-c = 10678.82 (39 estimated parameters)
* For *test* branches, the following rate distribution for branch-site combinations was inferred

|          Selection mode           |     dN/dS     |Proportion, %|               Notes               |
|-----------------------------------|---------------|-------------|-----------------------------------|
|        Negative selection         |     0.000     |    5.581    |                                   |
|        Negative selection         |     0.012     |   94.046    |                                   |
|      Diversifying selection       |     9.781     |    0.374    |                                   |

* The following rate distribution for site-to-site **synonymous** rate variation was inferred

|               Rate                | Proportion, % |               Notes               |
|-----------------------------------|---------------|-----------------------------------|
|               0.398               |    32.899     |                                   |
|               0.420               |    52.826     |                                   |
|               4.533               |    14.275     |                                   |

### Performing the constrained (dN/dS > 1 not allowed) model fit
* Log(L) = -5301.39, AIC-c = 10679.13 (38 estimated parameters)
* For *test* branches under the null (no dN/dS > 1 model), the following rate distribution for branch-site combinations was inferred

|          Selection mode           |     dN/dS     |Proportion, %|               Notes               |
|-----------------------------------|---------------|-------------|-----------------------------------|
|        Negative selection         |     0.000     |   89.934    |                                   |
|        Negative selection         |     0.000     |    6.759    |       Collapsed rate class        |
|         Neutral evolution         |     1.000     |    3.307    |                                   |

* The following rate distribution for site-to-site **synonymous** rate variation was inferred

|               Rate                | Proportion, % |               Notes               |
|-----------------------------------|---------------|-----------------------------------|
|               0.381               |    85.219     |                                   |
|               3.603               |    13.886     |                                   |
|              19.555               |     0.895     |                                   |

----
## Branch-site unrestricted statistical test of episodic diversification [BUSTED]
Likelihood ratio test for episodic diversifying positive selection, **p =   0.1564**.

I note in particular that the conclusion differs from that reached in the tutorial, i.e. I get p = 0.1564 rather than p = 0.0015. My gut says this is due to recent advances in synonymous rate variation, whose incorporation reveals that the signal of positive selection is not significant. However, it would also be great to have some reassurance that this result is reproducible and not something going wrong with my installation/machine. Would be very grateful for any feedback / verification / understanding!

With gratitude, Chase

spond commented 2 months ago

Dear @singing-scientist,

Your intuition is correct. In 2024, HyPhy will use site-to-site synonymous rate variation (SRV) by default. In the tutorial, the default (in fact the only option for 2017) was to have constant synonymous rates. To emulate this behavior one needs to specify --srv No, or run hyphy busted -i (to force all prompts even with default values) for an interactive mode

Try

hyphy busted --alignment ksr2.fna --srv No
...

## Performing the full (dN/dS > 1 allowed) branch-site model fit
* Log(L) = -5319.96, AIC-c = 10708.20 (34 estimated parameters)
* For *test* branches, the following rate distribution for branch-site combinations was inferred

|          Selection mode           |     dN/dS     |Proportion, %|               Notes               |
|-----------------------------------|---------------|-------------|-----------------------------------|
|        Negative selection         |     0.025     |   99.830    |                                   |
|        Negative selection         |     0.028     |    0.132    |                                   |
|      Diversifying selection       |    113.977    |    0.037    |                                   |

### Performing the constrained (dN/dS > 1 not allowed) model fit
* Log(L) = -5326.47, AIC-c = 10719.21 (33 estimated parameters)
* For *test* branches under the null (no dN/dS > 1 model), the following rate distribution for branch-site combinations was inferred

|          Selection mode           |     dN/dS     |Proportion, %|               Notes               |
|-----------------------------------|---------------|-------------|-----------------------------------|
|        Negative selection         |     0.000     |   94.204    |                                   |
|        Negative selection         |     0.000     |    2.486    |       Collapsed rate class        |
|         Neutral evolution         |     1.000     |    3.310    |                                   |

----
## Branch-site unrestricted statistical test of episodic diversification [BUSTED]
Likelihood ratio test for episodic diversifying positive selection, **p =   0.0007**.

Best, Sergei

singing-scientist commented 2 months ago

Works like a charm! Thanks so much as always your your swift help and explanation, greatly appreciated!

Yours, Chase