veg / hyphy

HyPhy: Hypothesis testing using Phylogenies
http://www.hyphy.org
Other
200 stars 68 forks source link

Saturation and uneven tree topology (RELAX) #1661

Closed liamfriar closed 6 months ago

liamfriar commented 8 months ago

Hi, my question is very similar to #1411 , but I was not able to determine next steps from reading that issue.

I have 40 outgroup (free-living) and 8 ingroup (endosymbiont) bacteria genomes. The root of the full tree is almost certainly over 1 billion years old. The root of the ingroup tree is 50-100 million years old. I have run RELAX on ~2000 single-copy orthologs using the ingroup a priori . Looking at the portion of orthologs that fell into each broad result category:

35% of orthologs relaxation
3.5% intensification
61% not significant
0.5% did not converge.

To try to control for bias from the algorithm and dataset, I then ran RELAX again, instead using each of 2 different subgroups from the outgroup as the new ingroup (and not including the old ingroup). Both new ingroups are not endosymbionts, but the breakdown of results looks pretty similar.

14% / 33% relaxation
6% / 2.5% intensification
79% / 60% not significant
1% / 0.5% did not converge

I certainly would not expect 1/3 of a free-living bacteria's genome to be under relaxed selection.

  1. Could this be the result of the outgroup being too divergent and thus over-saturated for mutations or could it be a result of ingroups being not divergent enough? Randomly taking one of the 2000 results files, I see * 1 partition. Total tree length by partition (subs/site) 3.900 Maybe it would be helpful to run FitMG94 on the 2000 orthologs, concatenated to see if dS is within a reasonable range like 0.5-1.5?

  2. Maybe it would be helpful to subset the outgroup to be basically pairs of genomes that are similarly diverged from one another as the most diverged 2 genomes in the ingroup (which I could then also subset to use only those 2 genomes)?

  3. Maybe this way of counting results in each category isn't valid? I want to look at specific genes within the 2000, but I want to make sure I have used this algorithm correctly first.

  4. Any other thoughts/suggestions would be hugely appreciated

Thank you!

spond commented 8 months ago

Dear @liamfriar,

Just to be clear: did you use endosymbiont as test as free-living as reference? Also, which HyPhy/RELAX versions are you running?

3.9 subs/site for the entire tree is not too bad, saturation really becomes a problem if you have very long individual branches.

I would also try a few other things

  1. Try setting --starting-points N where N is ~10, and --models Minimal.
  2. Explore other analysis settings, especially --srv Yes (to turn on site-to-site synonymous rates). This may have a pretty strong effect for divergent species.
  3. Inspect individual analysis runs to make sure point estimates for the distributions look kosher.

Can you share one or two of your alignments and trees? I can take a look and see if anything jumps out.

Best, Sergei

liamfriar commented 8 months ago

Hi @spond

Thank you for helping! Yes, endosymbionts are test and free-living are reference

I am running HYPHY 2.5.51(MP) for Linux on x86_64 Orthologous genes were determined by orthofinder MSAs were generated with MACSE Trees were generated by orthofinder (which implemented FastTree I believe) and then trees were trimmed to remove unwanted sequences using Gotree The initial call is: hyphy relax --alignment msa.tmp --tree ${tree_dir}/${hog}_tree.labeled.txt --test "test" --reference "reference" --models Minimal --output $outfile > $stdoutfile And then for runs that fail to converge, the following is used, which gets about 2/3 of the initially unconverged runs to converge. hyphy CPU=1 relax --starting-points 100 --grid-size 2000 --models Minimal --alignment msa.tmp --tree ${tree_dir}/${hog}_tree.labeled.txt --test "test" --reference "reference" --output $outfile > $stdoutfile I have been scraping the result, k-value, and p-value from the $stdoutfile as it is easier for me to parse than the .json, but I can of course switch that if needed.

  1. My understanding is that increasing --starting-points is beneficial because it gives the algorithm more seeds to find the best fit, but detrimental from a time standpoint. Is that correct?
  2. Is there any downside to --srv in your opinion? I read a bit about it in your recent publication https://doi.org/10.1093/molbev/msad150 and it seems like it should be purely helpful.
  3. Does this basically mean that the omega values aren't insanely high or 0 with too great a portion of the distribution?
  4. I have attached 4x(msa, tree, out.json, stdout.txt). The gene is identified by the number in the filenames (0301 or 1155) and the test group is identified by the beginning of the filename (endosymbiont or control where control is subsampled from reference that was used when test was endosymbiont and then run against the same reference minus itself and without the endosymbiont clade). files4github.gz

Thank you so much for this tool and for being so generous with your responses to these questions.

github-actions[bot] commented 6 months ago

Stale issue message