nclark-lab / RERconverge

Analysis of convergence between organismal traits and DNA/protein sequences
GNU General Public License v3.0
45 stars 26 forks source link

readTrees() and subsequent getAllResiduals errors #39

Closed lauterbur closed 5 years ago

lauterbur commented 5 years ago

Hi RERconverge team,

I'm running into a couple of errors/issues when I try to run RERconverge with my data.

Issue 1: While my trees are all subsets of one complete tree, none of them actually have all of the species in that overall tree. So when I run readTrees() I get: Error in matchNodesInject(tree1, tree2) : The following species in tree1 do not exist in tree2: [list of species] This is resolved if I give the file a "dummy" gene corresponding to the overall tree, but I'm not sure if that's the right thing to do here or if it means my data set just isn't appropriate. Which brings me to

Issue 2: readtrees() on the file including the "dummy" overall tree tells me "Not enough genes with all species present: master tree has no edge.lengths." It seems I can continue in this state, but when I run RERconverge::getAllResiduals() with the defaults as listed in the vignette, it gives me this error: "Error in cut.default(mml, breaks = breaks) : 'breaks' are not unique" I assume this is something to do with the lack of branch lengths in the master tree?

My tree file is attached here. I'd appreciate any insight! Thanks. allGeneTrees.txt

sorrywm commented 5 years ago

Hi @lauterbur,

Thank you for alerting us to these issues. The current version of RERconverge requires at least 20 gene trees with all species present in order to calculate edge lengths for the full master tree; this is why you get the error even when you use the dummy tree. We've been working on an option to allow the user to specify a master tree rather than having it estimated from the gene trees, and we can let you know when this is available. In the mean time, I believe you could enter the same dummy tree 20 times to get around the RERconverge threshold.

The bigger (and tougher) question is whether your dataset is appropriate for RERconverge. The benefit of using relative rates rather than raw evolutionary rates is that you can correct for genome-wide background rates of evolution when testing for unusual patterns in your foreground branches at specific genes. If the genes in your set are candidate genes, rather than a large set of genes representing genome-wide rates of evolution, the rates used to normalize your branch lengths and get relative rates may be skewed. In this case, you may want to do a simpler test of differences in raw rates between foreground and background lineages. Whether using RERconverge or another program, you may need a "control set" of genes to which to compare your candidate genes.

I hope this is helpful. If we can be of further help, please let us know.

Wynn

yuzhenpeng commented 5 years ago

Dear RERcoverge designer,

I had the same problems as above:the function of readTree is right,while the getAllResiduals give a warnings.

`mamTrees=readTrees("C:/Users/lenovo/Desktop/RERconverge-master/R/mammal36aa.trees") Read 15696 items max is 23 estimating master tree branch lengths from 7848 genes.

mamRERw = getAllResiduals(mamTrees,transform = "sqrt", weighted = T, scale = T) cutoff is set to 8e-06 Error in cut.default(mml, breaks = breaks) : 'breaks' are not unique ` I do not know what happend,could you give me some suggestions. Looking forward to your positive response.

My tree file is attached here. Thanks. @sorrywm @raghavendranpartha @mchikina 11111.txt

sorrywm commented 5 years ago

Hi @yuzhenpeng! I took a look, and I think I know why this is happening; I will describe it below in case you are interested. I will also work to fix this issue, but in the mean time, if you source RERfuncs.R from branch Issue39 and use the following command, I think you should be able to proceed (feel free to post if this does not work, or causes further errors): mamRERw = getAllResiduals(mamTrees,transform = "sqrt", weighted = T, scale = T, plot = F)

Rationale: This error occurs because of an internal function that plots the effects of our weighted regression to correct for the relationship between mean and variance of branch lengths (for more information, see this preprint). The function rounds quantiles of branch lengths so that they are easier to read in the plot, but then this causes an error when rounding causes the quantiles to be the same.

yuzhenpeng commented 5 years ago

Thanks for your kind help. With your suggestion as above, I am able to proceed getAllResiduals() . But now, I met the new question. I will describe it below:

In your example, I get result like this:i = 1, i= 4, i = 6, ... ,until i = 200. QQ图片20190428120813 QQ图片20190428120840

But, in my database, the result is vastly different (totally, there are 7846 genes). It only have three: i = 1, i = 150, i= 3966. I want to know the reason of this difference. QQ图片20190428121855

Thank you very much. @sorrywm

sorrywm commented 5 years ago

Hi @yuzhenpeng! The getAllResiduals function is written to estimate RER for multiple genes at once when possible. It does calculations simultaneously for all genes that share the same set of species in their gene trees, as long as branch lengths are non-NA for these gene trees. In the example in the walk-through, genes 1 - 3 have data for the same set of species, but gene 4 has a different set of species, so its RER are estimated separately. In your dataset, all of your genes have the same set of species; however, gene 150 is unusual in that all the branch lengths are all nan. The test file you sent did not contain all genes, so I can't test gene 3966, but I suspect that it either has a different set of species or all nan branch lengths like gene 150. Hope this is helpful!

nclark-lab commented 5 years ago

@lauterbur @yuzhenpeng Thanks for raising these issues. I hope we were able to address them. If you have further difficulties just post a new issue.

Frank1219 commented 3 years ago

Hi, I got all NA after I used getAllResiduals function, my tree file is like this: ((((Scr: 0.09432, PtrR: 0.25619): 0.53261, (Mku: 0.12212, Bpl: 0.11323): 0.39060): 0.13721, (Llu: 0.52161, Cte: 0.51477): 0.11940): 0.13313, Nge: 0.14091); Is there something error with my tree file ? Could you give me some suggestions ? @nclark-lab