traitecoevo / phyndr

Match tip and trait data
Other
9 stars 0 forks source link

Determining which clades are monophyletic #11

Closed eliotmiller closed 9 years ago

eliotmiller commented 9 years ago

Matt and I were talking the other day about how to determine which clades are monophyletic, etc. Obviously that is now inherent in phyndr, but wondering about making the outputs more explicit. For instance, one could do something like the following to see both what is monophyletic, and which taxon/a render a given clade polyphyletic. Note that the case is currently specific to species-genus-family situations, though it's easy to modify it to just be hierarchically nested clades. Feel free to cannibalize or ignore or whatever. Also it currently relies on a single hidden geiger function (get descendants) but that's a pretty easy one to re-write in base R (you almost certainly already have it in this package).

library(geiger)

#this function is currently specific to species-genus-family situations

tree <- sim.bdtree(stop="taxa", n=100)

species <- as.character(tree$tip.label)

genus <- as.character(c(rep("genusA", 25), rep("genusB", 25), 
rep("genusC", 25), rep("genusD", 25)))

family <- as.character(c(rep("family1", 50), rep("family2", 50)))

lookup <- data.frame(species, genus, family, stringsAsFactors=FALSE)

isMonophyletic <- function(tree, lookup.table)
{
genera <- as.character(unique(lookup.table$genus))
families <- as.character(unique(lookup.table$family))

genusResults <- c()
familyResults <- c()

for(i in 1:length(genera))
{
    temp <- as.character(lookup.table$species[lookup.table$genus == genera[i]])
    genusResults[i] <- is.monophyletic(phy=tree, tips=temp)
}

genusResults <- data.frame(genus=genera, monophyletic=genusResults,
    stringsAsFactors=FALSE)

for(i in 1:length(families))
{
    temp <- as.character(lookup.table$species[lookup.table$family == families[i]])
    familyResults[i] <- is.monophyletic(phy=tree, tips=temp)
}

familyResults <- data.frame(family=families, monophyletic=familyResults,
    stringsAsFactors=FALSE)

output <- list("genera" = genusResults, "families" = familyResults)

output
}

rendersPolyphyletic <- function(tree, lookup.table)
{
    monoResults <- isMonophyletic(tree, lookup.table)

genera <- monoResults$genera$genus[monoResults$genera$monophyletic==FALSE]
families <- monoResults$families$family[monoResults$families$monophyletic==FALSE]

if(length(genera) < 1)
{
    genusResults <- "All genera are monophyletic"
    print(genusResults)
}   
else
{
    genusResults <- list()

    for(i in 1:length(genera))
    {
#this will determine which species are in a given genus
        tempSpecies <- as.character(lookup.table$species[lookup.table$genus 
            == genera[i]])
#this will determine what the MRCA of those taxa are
        tempNode <- getMRCA(tree, tempSpecies)
        #this will give you all descendants of that node, identified by node
        tempTips <- geiger:::.get.descendants.of.node(tempNode, tree, tips=TRUE)
        #identify actual species based on node
        tempResults <- tree$tip.label[tempTips]
        #only keep species that are not in the genus in question
   #i.e. species that render the genus polyphyletic
            genusResults[[i]] <- tempResults[!(tempResults %in% tempSpecies)]
        }

    names(genusResults) <- genera
}

if(length(families) < 1)
{
    familyResults <- "All families are monophyletic"
    print(familyResults)
}
else
{
    familyResults <- c()

    for(i in 1:length(families))
    {
        tempSpecies <- as.character(lookup.table$species[lookup.table$family 
            == families[i]])
        tempNode <- getMRCA(tree, tempSpecies)
        tempTips <- geiger:::.get.descendants.of.node(tempNode, tree, tips=TRUE)
        tempResults <- tree$tip.label[tempTips]
        familyResults[[i]] <- tempResults[!(tempResults %in% tempSpecies)]
    }
    names(familyResults) <- families
}

results <- list("genera" = genusResults, "families" = familyResults)

results
}

rendersPolyphyletic(tree, lookup)
richfitz commented 9 years ago

It's early here, but I think that this function is doing what you want (determining which groups make another paraphyletic) and then this section does probably what most of the above does but with all the corner cases needed to make phyndr do the right thing (the pattern of mono- and para-phyletic clades with respect to the trait data is important for working out how to collapse things).

It'd be pretty easy to expose the find_paraphyletic function, but I'm not sure what the use case would be for knowing this generally. Seems more in scope for something like geiger?

eliotmiller commented 9 years ago

Yeah that find_paraphyletic function looks like it's exactly it. I wrote those ones above because Matt mentioned a few weeks back that someone else needed functions to quickly determine which clades were and were not monophyletic. I was trying to think of useful ways to output the answer, and that was what I came up with--determining what renders a clade poly/para-phyletic. The use case would be whatever said third party wanted it for, but Matt would know more there. Seems like something people might want. Some of us were talking about putting out a utility phylo function package, so we could put them in there if the need to expose them here seems questionable. No strong opinions.

mwpennell commented 9 years ago

I guess this is mostly redundant with Issue #13 which I think we are putting on the back burner for the time being. thanks for the code and the comments @eliotmiller may come in handy down the line