Closed liamfriar closed 6 months ago
Dear @liamfriar,
Just to be clear: did you use endosymbiont as test as free-living as reference? Also, which HyPhy/RELAX versions are you running?
3.9 subs/site for the entire tree is not too bad, saturation really becomes a problem if you have very long individual branches.
I would also try a few other things
--starting-points N
where N
is ~10, and --models Minimal
.--srv Yes
(to turn on site-to-site synonymous rates). This may have a pretty strong effect for divergent species.Can you share one or two of your alignments and trees? I can take a look and see if anything jumps out.
Best, Sergei
Hi @spond
Thank you for helping! Yes, endosymbionts are test and free-living are reference
I am running HYPHY 2.5.51(MP) for Linux on x86_64
Orthologous genes were determined by orthofinder
MSAs were generated with MACSE
Trees were generated by orthofinder
(which implemented FastTree
I believe)
and then trees were trimmed to remove unwanted sequences using Gotree
The initial call is:
hyphy relax --alignment msa.tmp --tree ${tree_dir}/${hog}_tree.labeled.txt --test "test" --reference "reference" --models Minimal --output $outfile > $stdoutfile
And then for runs that fail to converge, the following is used, which gets about 2/3 of the initially unconverged runs to converge.
hyphy CPU=1 relax --starting-points 100 --grid-size 2000 --models Minimal --alignment msa.tmp --tree ${tree_dir}/${hog}_tree.labeled.txt --test "test" --reference "reference" --output $outfile > $stdoutfile
I have been scraping the result
, k-value
, and p-value
from the $stdoutfile
as it is easier for me to parse than the .json
, but I can of course switch that if needed.
--starting-points
is beneficial because it gives the algorithm more seeds to find the best fit, but detrimental from a time standpoint. Is that correct?omega
values aren't insanely high or 0 with too great a portion of the distribution?Thank you so much for this tool and for being so generous with your responses to these questions.
Stale issue message
Hi, my question is very similar to #1411 , but I was not able to determine next steps from reading that issue.
I have 40 outgroup (free-living) and 8 ingroup (endosymbiont) bacteria genomes. The root of the full tree is almost certainly over 1 billion years old. The root of the ingroup tree is 50-100 million years old. I have run RELAX on ~2000 single-copy orthologs using the ingroup a priori . Looking at the portion of orthologs that fell into each broad result category:
To try to control for bias from the algorithm and dataset, I then ran
RELAX
again, instead using each of 2 different subgroups from the outgroup as the new ingroup (and not including the old ingroup). Both new ingroups are not endosymbionts, but the breakdown of results looks pretty similar.I certainly would not expect 1/3 of a free-living bacteria's genome to be under relaxed selection.
Could this be the result of the outgroup being too divergent and thus over-saturated for mutations or could it be a result of ingroups being not divergent enough? Randomly taking one of the 2000 results files, I see
* 1 partition. Total tree length by partition (subs/site) 3.900
Maybe it would be helpful to run FitMG94 on the 2000 orthologs, concatenated to see ifdS
is within a reasonable range like 0.5-1.5?Maybe it would be helpful to subset the outgroup to be basically pairs of genomes that are similarly diverged from one another as the most diverged 2 genomes in the ingroup (which I could then also subset to use only those 2 genomes)?
Maybe this way of counting results in each category isn't valid? I want to look at specific genes within the 2000, but I want to make sure I have used this algorithm correctly first.
Any other thoughts/suggestions would be hugely appreciated
Thank you!