veg / hyphy

HyPhy: Hypothesis testing using Phylogenies
http://www.hyphy.org
Other
221 stars 69 forks source link

Does excluding species with paralogous sequences violate aBSREL and/or RELAX assumptions of single-copy orthology? #1725

Closed pkfsantos closed 1 month ago

pkfsantos commented 4 months ago

I used orthogroups identified in OrthoFinder as input for HyPhy analyses. I am working with 27 species, and there are 2,592 orthogroups identified as single-copy in all species. To include more orthogroups in the HyPhy analysis, I included those that were single-copy in at least 70% of the species and excluded the species with paralogous sequences.

Does that violate HyPhy assumptions of single-copy orthology? I have read in another post that aBSREL can handle paralogous sequences, so I understand that including or not including the paralogous sequences does not violate the assumptions. What about RELAX?

I ask from an evolutionary perspective. If some orthogroups are evolving faster in certain species that have experienced gene duplication and loss, could these orthogroups be considered as faster evolving overall? Even in species that do not currently have duplications, could this affect the analysis or violate HyPhy's assumptions of single-copy orthology?

I appreciate any comments on that.

Priscila

spond commented 4 months ago

Dear @pkfsantos,

aBSREL and RELAX (and most other selection methods) do not make specific assumptions re: orthology v paralogy. Considerations specific to including multiple gene copies per organism would be as follows:

  1. Species trees may not apply here, so a good idea is to use gene trees.
  2. Depending on gene families, you may need to account for recombination-like processes (e.g. gene conversion).
  3. Depending on the function, you could expect that some of the copies may be under different selection pressure (e.g. neofunctionalization and/or pseudogenization).

Where you need to think carefully about your hypotheses is how you run these models: how to define branch sets or how to interpret inferences.

Orthogroups are a rather fluid concept:). Can you give me a bit more specifics? For example, could you include a few example trees and elaborate on what evolutionary hypotheses you are interested in testing?

Best, Sergei

pkfsantos commented 4 months ago

Dear Sergei,

Thank you for your response. I can share more details of the analysis.

I am testing the hypothesis that bee species which have lost prepupal diapause show convergent signals of selection. In this context, "lost prepupal diapause" refers to species that either undergo diapause in the adult stage or do not undergo diapause at all. The species that do not experience prepupal diapause are the test branches in my analysis.

Here are the command lines I used:

For aBSREL:

$HOME/hyphy-2.5.55/hyphy LIBPATH=/path/hyphy-2.5.55/res absrel --tree {input.tree} \
--multiple-hits Double+Triple --srv Yes --branches Test --alignment {input.alignment} --output {output}

For RELAX:

$HOME/hyphy-2.5.55/hyphy LIBPATH=/path/hyphy-2.5.55/res relax --tree {input.tree} \
--multiple-hits Double+Triple --srv Yes --test Test --alignment {input.alignment} --models All --output {output}

Below are examples of gene trees I used in these commands. They vary in the number of species included, ranging from 20 to 27. Although there are 27 species in total, I excluded species with paralogous sequences and their respective sequences from the analysis.

Trees:

OG0000758 (20 spp)- (Apis_mellifera{Test},(Bombus_campestris{Test},Bombus_vancouverensis{Test})98:0.1248289060,(((((((Osmia_bicornis:0.3083666780,Colletes_gigas :0.2810296065)71:0.0365119381,Ceratina_calcarata{Test})64:0.0256354274,Dufourea_novaeangliae:0.2744228762)56:0.0212980832,Andrena_dorsata{Te st})29:0.0000021463,((((Augochlora_pura{Test},Megalopta_genalis{Test})94:0.0506933522,((Lasioglossum_leucozonium{Test},Halictus_rubicundus{T est})77:0.0110665378,Lasioglossum_morio{Test})98:0.0677231186)96:0.0619332421,Nomia_melanderi:0.0943288220)97:0.1655076919,(((Tetragonula_ca rbonaria{Test},Frieseomelitta_varia{Test})96:0.1042681291,Eufriesea_mexicana:0.1305487421)67:0.0380758794,Megachile_rotundata:0.4134365993)7 3:0.0318823231)79:0.3747813131)11:0.0000020202,Macropis_europaea:0.5991660057)47:0.0681293209,Peponapis_pruinosa:0.2863838610)96:0.142377509 4);

OG0004459 (27 spp)- (Megachile_rotundata:0.3296167985,((Macropis_europaea:0.6722070970,(((((((Augochlorella_aurata{Test},Augochlora_pura{Test})100:0.0728959509,Megalopta_genalis{Test})100:0.2107037633,((Agapostemon_virescens{Test},Sphecodes_monilicornis{Test})74:0.0082589795,(Halictus_rubicundus{Test},(Lasioglossum_leucozonium{Test},Lasioglossum_morio{Test})100:0.0268164910)100:0.0251728570)100:0.0954600370)100:0.1831784910,Nomia_melanderi:0.4073433383)100:0.1515534240,Dufourea_novaeangliae:0.4419274950)100:0.2545452597,Colletes_gigas:0.4092848196)100:0.0647613480,Andrena_dorsata{Test})98:0.0329326202)100:0.2645936364,((((Peponapis_pruinosa:0.6184577402,Ceratina_calcarata{Test})61:0.0810584608,Tetrapedia_diversipes:0.3697968136)85:0.0339595109,(((Bombus_campestris{Test},Bombus_vancouverensis{Test})100:0.1768048806,(Tetragonula_carbonaria{Test},(Frieseomelitta_varia{Test},Melipona_quadrifasciata{Test})95:0.0218915305)100:0.1833626573)100:0.0920878035,((Euglossa_dilemma{Test},Eufriesea_mexicana:0.0707405197)100:0.2460430339,Apis_mellifera{Test})98:0.0553461104)100:0.2630127122)65:0.0405230248,Habropoda_laboriosa:0.3532001790)100:0.1712057359)100:0.3089256852,Osmia_bicornis:0.2965500768);

OG0003524 (27 spp) (Lasioglossum_morio{Test},((Sphecodes_monilicornis{Test},(((((((((Tetrapedia_diversipes:0.2688643652,Ceratina_calcarata{Test})76:0.0806217503,((((((((Osmia_bicornis:0.3547675633,Colletes_gigas:0.4312002746)87:0.1734586725,Peponapis_pruinosa:0.3207922286)88:0.0656998797,Apis_mellifera{Test})92:0.0727635516,Bombus_vancouverensis{Test})96:0.1695106008,Frieseomelitta_varia{Test})94:0.1657699678,(Tetragonula_carbonaria{Test},Melipona_quadrifasciata{Test})42:0.0000029268)94:0.1238911943,Bombus_campestris{Test})96:0.1035711120,(Eufriesea_mexicana:0.1301948644,Euglossa_dilemma{Test})100:0.2319604865)71:0.0670826138)79:0.0409503949,Habropoda_laboriosa:0.3169983723)95:0.1079844275,Megachile_rotundata:0.3937959160)67:0.0518535143,(Andrena_dorsata{Test},Macropis_europaea:0.3302231674)86:0.0872625623)98:0.1231582917,Dufourea_novaeangliae:0.2 958274586)87:0.0543614345,Nomia_melanderi:0.2353898934)100:0.1835103701,((Augochlora_pura{Test},Megalopta_genalis{Test})94:0.0126039313,Augochlorella_aurata{Test})100:0.1650692566)100:0.1037560067,Agapostemon_virescens{Test})75:0.0151908692)79:0.0237630570,Halictus_rubicundus{Test})97:0.0147355006,Lasioglossum_leucozonium{Test});

OG0000697 (22 spp) (Habropoda_laboriosa:0.3333781780,((((((((Sphecodes_monilicornis{Test},(Halictus_rubicundus{Test},Lasioglossum_leucozonium{Test})100:0.0472830663)99:0.0113294227,Agapostemon_virescens{Test})100:0.1115246061,((Augochlora_pura{Test},Augochlorella_aurata{Test})100:0.0347544573,Megalopta_genalis{Test})100:0.2136747060)100:0.2727579051,Nomia_melanderi:0.3297610118)100:0.3403229647,Colletes_gigas:0.4621638136)88:0.0451388019,Andrena_dorsata{Test})93:0.1712115457,(Osmia_bicornis:0.2575983879,Megachile_rotundata:0.3065534547)100:0.3238268730)92:0.1314671553,Ceratina_calcarata{Test})91:0.0466296593,(((Apis_mellifera{Test},Eufriesea_mexicana:0.3699447303)74:0.0255258128,((Frieseomelitta_varia{Test},Tetragonula_carbonaria{Test})100:0.1341296737,(Bombus_campestris{Test},Bombus_vancouverensis{Test})100:0.2598948297)100:0.0931200954)98:0.0630305504,(Peponapis_pruinosa:0.4016097663,Tetrapedia_diversipes:0.3560483073)100:0.0651485715)96:0.0511089335);

Best,

Priscila

spond commented 3 months ago

Dear Priscila,

We have a specialized analysis, BUSTED-PH (and protein based RER), designed to test for convergent evolution. Take a look at https://github.com/veg/hyphy-analyses/tree/master/BUSTED-PH and https://github.com/veg/hyphy-analyses/tree/master/RER

If you end up including paralogs, put them all in the {test} or {background} group as appropriate. You could also exclude species with paralogs as well as you did in the examples.

It looks like you should have plenty of power to test for this, given that you have a good number of branches in the test and reference sets.

I am tagging a graduate student in our group, @agselberg, who's been working on these types of analyses. She should be able to help you out (I am going to be AFK for ~2 weeks).

Best, Sergei

pkfsantos commented 3 months ago

Thank you, Sergei.

@agselberg, just to continue the conversation: I used aBSREL instead of BUSTED because I was interested in identifying which specific branches were showing signals of positive selection. In the end, only a few branches (a maximum of 3 out of 27 species) showed convergent signals of positive selection.

I am also combining the results from aBSREL and RELAX with RERconverge, which I believe is a similar approach to what was suggested in the last message.

My main concern (which was actually raised by a reviewer) was about the potential issue of violating HyPhy’s assumptions when excluding species with paralogs, but I now understand that this is not the case.

Please feel free to share any additional comments on the approaches used.

Thank you, Priscila

agselberg commented 3 months ago

Priscila,

I agree with Sergei about the paralogs, you should be fine because no specific assumptions are made.

I wanted to emphasize- if you are testing for signals of positive selection in specific, single branches aBSREL should be used. But if you are testing if a gene is showing convergent signals of positive selection (similar to RELAX/RER methods), BUSTED-PH is recommended.

Either (or both) methods are fine to run but they will impact how you interpret your results.

Best, Avery.

pkfsantos commented 3 months ago

Perfect! Thank you Avery for the further clarification. Priscila

github-actions[bot] commented 1 month ago

Stale issue message