nclark-lab / RERconverge

Analysis of convergence between organismal traits and DNA/protein sequences
GNU General Public License v3.0
44 stars 26 forks source link

can i provide a neutral tree with branch length as RERconverge master tree #94

Open shuifeng1988 opened 3 months ago

shuifeng1988 commented 3 months ago

Hi, I used RERconverge to run more than 1 million genes (non-coding regions), but it runs too slowly. so I want to split the gene file (with branch length estimated by paml baseml) to multiple files, thus I can run parallel. For doing this, each gene file must use the same master tree and identical cutoff. I have two master trees, one is estimated by RERconverge, it reads more than 8000 genes with full species and returns a master tree, another is estimated by Phylofit, the branch lengthes estimated by two methods are different huge. However, I found the results are identical, I don't know why, can RERconverge use the master tree provided by user?

Below, is the scripts i used: library(RERconverge) args <- commandArgs(trailingOnly = TRUE) input <- args[1] ##one of the split tree files with branch lengthes estimated by paml baseml

tree1 <- read.tree("../all.mod.tree") ##master tree length estimated by phyloFit treerooted1 <- root(tree1,c("monDom5","sarHar1","HLphaCin1","Echidan"))

tree2 <- read.tree("../RERconverge_masterTree.nwk") #master tree estimated by RERconverge, i extract all genes with full species as a file, and use RERconverge to estimate the master tree branch lengthes treerooted2 <- root(tree2,c("monDom5","sarHar1","HLphaCin1","Echidan"))

fg=c("Vulpes_ferrilata","Equus_kiang","panHod1","Bos_mutus","Ochotona_curzoniae","Marmota_himalayana","Myospalax_baileyi","Lasiopodomys_fuscus","rhiBie1")

the master tree was calculated by RERconverge itself

toyTrees <- readTrees(input,minSpecs=51,masterTree=treerooted2) #I found the RERconverge masterTree just use the topology of treerooted2, but not use the branch length in treerooted2; toyTrees$masterTree <- treerooted2 ## i used the master tree estimated by RERconverge previously to replace the master tree, and i found the replace is sucessful, the masterTree edge.length is difference before and after replace mamRERw = getAllResiduals(toyTrees,transform = "sqrt", weighted = T, scale = T, cutoff =0.00002) multirers = returnRersAsTreesAll(toyTrees,mamRERw) write.tree(multirers, file='toyRERs.nwk', tree.names=TRUE) phenv=foreground2Paths(fg, toyTrees, clade="terminal") cor=correlateWithBinaryPhenotype(mamRERw, phenv, min.sp=51, min.pos=4,weighted="auto") write.table(cor,file="results.txt",quote=F,row.names=T)

the master tree was calculated by Phylofit (all.chr.mod)

toyTrees2 <- readTrees(input,minSpecs=51,masterTree=treerooted1)
toyTrees2$masterTree <- treerooted1 ## i used the master tree estimated by phyloFit previously to replace the master tree mamRERw_all.mod = getAllResiduals(toyTrees2,transform = "sqrt", weighted = T, scale = T, cutoff =0.00002) saveRDS(mamRERw_all.mod, file="mamRERw_all.mod.rds") multirers2 = returnRersAsTreesAll(toyTrees2,mamRERw_all.mod) write.tree(multirers2, file='toyRERs_all.mod.nwk', tree.names=TRUE) phenv2=foreground2Paths(fg, toyTrees2, clade="terminal") cor2=correlateWithBinaryPhenotype(mamRERw_all.mod, phenv2, min.sp=51, min.pos=4,weighted="auto") write.table(cor2,file="results_all.mod.txt",quote=F,row.names=T)

nclark-lab commented 3 months ago

Hello, we have also used a similar strategy to split about 1 million trees into several smaller groups. Generally, it works well, and the master trees generated for each group are highly similar. In general, your strategy is a good one. The reason that you do not see a difference is because getAllResiduals recalculates the average branch lengths for each unique tree (each with different species set), so the master tree branch lengths will not directly affect the RERs that are produced. I hope this helps. We have often split about 1 million non-coding regions into 20 different datasets, and it worked identically to a similar run with fewer, larger groups.

shuifeng1988 commented 3 months ago

Thanks a lot, I got it and i will follow your strategy this time. But is it possible to update the getAllResiduals module to allow the user closes the recalculate function and use a special branch length? It is sometimes very useful. For example, i already know the evolutionary distance of my species and i just need to test the convergence of a few genes, or even a single gene.