nclark-lab / RERconverge

Analysis of convergence between organismal traits and DNA/protein sequences
GNU General Public License v3.0
43 stars 26 forks source link

tree discordance and multiple-copy genes #59

Open xiaoyezao opened 3 years ago

xiaoyezao commented 3 years ago

Hi developers,

This is a great tool to study convergent evolution. I have several questions:

  1. RERcoverge requires that all gene trees have the same topology (same with the species tree if I understand right). However, for the plants, it's very common that the gene trees are discordant with the species tree because of incomplete lineage sorting and/or hybridization. How to account for this discordance when using RERcoverge?
  2. RERcoverge requires that for a specific gene each species has only one copy (single copy gene?). For plants, I guess it's also true for animals, multiple-copy genes are very common because of gene/genome duplication. How to use these genes in RERcoverge analysis?

Looking forward to your help!

Karl

nclark-lab commented 3 years ago

Hello Karl,

Thanks for your interest. RERconverge makes the assumption that all gene trees have the same topology because that is the only way to ensure that corresponding branches of 2 genes can be located. This is how the program can use the internal branches. In practice, the user can either remove species that produce trees with lots of topological discord, or just fix the tree topology to be the most well supported species tree. Forcing it to match the species tree may not be harmful because we have found that the method works well even when there are errors introduced in the topology.

A program must be used that allows the user to fix the topology so that the branch lengths can be estimated. The R package ‘phangorn’ has worked well for us, and it is fast.

Second, it is a limitation that the approach only works for one-to-one orthologs. The reason is similar to above. There needs to be an unambiguous assignment of corresponding branches between the 2 gene trees.

I hope this answers your questions well.

Best, -Nathan


Nathan Clark Associate Professor of Human Genetics Adjunct Professor of Computational and Systems Biology University of Utah nclark@utah.edumailto:nclark@utah.edu http://nclarklab.org/

On Feb 7, 2021, at 9:18 PM, xiaoyezao notifications@github.com<mailto:notifications@github.com> wrote:

Hi developers,

This is a great tool to study convergent evolution. I have several questions:

  1. RERcoverge requires that all gene trees have the same topology (same with the species tree if I understand right). However, for the plants, it's very common that the gene trees are discordant with the species tree because of incomplete lineage sorting and/or hybridization. How to account for this discordance when using RERcoverge?
  2. RERcoverge requires that for a specific gene each species has only one copy (single copy gene?). For plants, I guess it's also true for animals, multiple-copy genes are very common because of gene/genome duplication. How to use these genes in RERcoverge analysis?

Looking forward to your help!

Karl

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/nclark-lab/RERconverge/issues/59, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHP5F5EDDFYZKJDXYT7ZV4DS55Q2VANCNFSM4XIH253A.

sangeet2019 commented 3 years ago

Hi Nathan,

For one of my datasets, I am thinking of trying to fix the discordant gene trees to match the species tree topology. But, I am not sure how it is exactly done using the ‘phangorn’ package. Can you please share some more details on doing this?

nclark-lab commented 3 years ago

Hello,

The phangorn package takes the alignments and the species tree topology as inputs and generates a new tree with branch lengths based on that alignment. It will not fix trees that you’ve already made. The best way forward would be to use the ‘estimatePhangornTreeAll’ function that we provide in the package to estimate trees for all of your genes at the same time using one method. That function has defaults to guide the phangorn function so all you need to provide are the alignments and the species tree topology.

Let us know how it goes.

Also this from the walk through vignette: We now provide tools for users to estimate approximate maximum likelihood trees from nucleotide or amino acid alignments using the pml and optim.pml functions from the phangorn package (Schliep 2011). Users must supply alignments in a format readable by read.phyDat, as well as a master tree in Newick format, representing the tree topology on which branch lengths should be estimated. For more details on how to generate trees from alignments using these tools, see the documentation for the estimatePhangornTree and estimatePhangornTreeAll functions.

Best, -Nathan


Nathan Clark Associate Professor of Human Genetics Adjunct Professor of Computational and Systems Biology University of Utah @.**@.> http://nclarklab.org/

On May 11, 2021, at 10:28 PM, Sangeet Lamichhaney @.**@.>> wrote:

Hi Nathan,

For one of my datasets, I am thinking of trying to fix the discordant gene trees to match the species tree topology. But, I am not sure how it is exactly done using the ‘phangorn’ package. Can you please share some more details on doing this?

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/nclark-lab/RERconverge/issues/59#issuecomment-839424185, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHP5F5CNH5IKLLXTHCG4NF3TNH7V5ANCNFSM4XIH253A.

sangeet2019 commented 3 years ago

Thanks Nathan.

Unfortunately, I am not able to find the "estimatePhangornTreeAll" function in version 0.1.0 and 0.2.0. I am assuming it's in the latest version 0.3.0. But, it seems the binary file of 0.3.0 is not yet posted in the "release" page, and somehow I am not able to install from the source. Any chance we can get the binary file for 0.3.0 ?

sorrywm commented 3 years ago

Dear Sangeet,

I have version 0.1.0, and estimatePhangornTree and estimatePhangornTreeAll are part of this release. They are also in the 'estimateTreeFuncs.R' script on the repo, should you wish to download that script separately. Might there be a typo?

We provide a very brief description of these wrapper functions in the version of the walk-through on the 'AddEstimateTreeFunctions' branch here: https://cdn.rawgit.com/nclark-lab/RERconverge/AddEstimateTreeFunctions/vignettes/FullWalkthroughUTD.html#data-input-requirements-and-formatting You can see more details on these functions via their documentation in R help.

If you are having difficulty with these functions, please post a new issue on github, and someone can attend to it. Thank you for using RERconverge.

Sincerely, Wynn

On Thu, May 13, 2021 at 1:07 AM Sangeet Lamichhaney < @.***> wrote:

Thanks Nathan.

Unfortunately, I am not able to find the "estimatePhangornTreeAll" function in version 0.1.0 and 0.2.0. I am assuming it's in the latest version 0.3.0. But, it seems the binary file of 0.3.0 is not yet posted in the "release" page, and somehow I am not able to install from the source. Any chance we can get the binary file for 0.3.0 ?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/nclark-lab/RERconverge/issues/59#issuecomment-840301593, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYSTOCPFNTM74XEAY7CU5LTNNNA3ANCNFSM4XIH253A .

Mia1349 commented 4 months ago

Dear developers,

Thank you for developing such an amazing tool!

I have a follow up question about using single-copy genes in the analysis. I think we should use as many gene as possible to obtain a more comprehensive results, the orthologues I use are clustered using OrthoFinder, then prune to get single copy genes and use those for RERconverge analysis. For a dataset with more than 100 species, tree-based pruning method could lead to significantly decrease in the total number of taxa in each gene, my question is should we include those genes with less than half of the taxa presents? would that causing any problem? what other ways you suggest to get the input genes for RERconverge analysis?

Best, Mia

nclark-lab commented 3 months ago

Hello Mia, You should include as many genes as possible. It is fine to only have a subset of the total species and RERconverge was written to handle this. Just pay attention to the minimum species (min.sp) and minimum positive/foreground species (min.pos) parameters. As in the call below, the defaults are set to 10 and 2, respectively.

corMarine=correlateWithBinaryPhenotype(mamRERw, phenvMarine, min.sp=10, min.pos=2, weighted="auto")

Best of luck.

Mia1349 commented 3 months ago

Hello Mia, You should include as many genes as possible. It is fine to only have a subset of the total species and RERconverge was written to handle this. Just pay attention to the minimum species (min.sp) and minimum positive/foreground species (min.pos) parameters. As in the call below, the defaults are set to 10 and 2, respectively.

corMarine=correlateWithBinaryPhenotype(mamRERw, phenvMarine, min.sp=10, min.pos=2, weighted="auto")

Best of luck.

Thank you for your response! I have another question about how to do the enrichment analysis on the resulting genes. The genes contain multiple taxa, how to select an appropriate background for GO enrichment analysis? Since I got the genes from OrthoFinder clustering, the background I am using now is the OGs (use emapper) that contain at least one forebranch taxa, and use all the genes in each OG with their GO terms as the background. I can get some results using this method but I do not know how much I could trust the results. Do you have any other suggestions about how to conduct the enrichment analysis on the resulting genes from RERconverge?

All the best, Mia

SWei2333 commented 1 week ago

Dear developers,

Thank you for developing such an amazing tool!

I have a follow up question about using single-copy genes in the analysis. I think we should use as many gene as possible to obtain a more comprehensive results, the orthologues I use are clustered using OrthoFinder, then prune to get single copy genes and use those for RERconverge analysis. For a dataset with more than 100 species, tree-based pruning method could lead to significantly decrease in the total number of taxa in each gene, my question is should we include those genes with less than half of the taxa presents? would that causing any problem? what other ways you suggest to get the input genes for RERconverge analysis?

Best, Mia

Hi Mia, I have some questions about using the single-copy gene set obtained from OrthoFinder for RERconverge. Could you please help me? In the result files from OrthoFinder, besides the orthologous single-copy genes, there are also many multi-copy genes and single-copy genes that are not present in all species. To reasonably obtain more usable genes, should I convert multi-copy genes into single-copy genes, and meantime reduce the species number requirement for the single-copy genes that are not present in all species when conducting RERconverge analysis?