Feature request: least common ancestor for equally scoring haplogroups

stephenturner commented 4 years ago

The feature released in 497cf6c13e23a7802eee5c3a96d07b855bff3119 in response to #12 allows for the export of the best n haplogroups.

When running with microarray data using the --chip flag, often there are many equally top-ranked haplogroups with the same score, due to the variants on the array being too low-resolution to distinguish between different leaves on the phylogeny.

We have implemented a post-hoc LCA approach where we first parse Phylotree to JSON, then traverse the tree given multiple haplogroups to find the LCA haplogroup node for all the supplied haplogroups.

Having this functionality natively implemented could be useful. I could imagine this option (e.g. --lca) could be exclusive with the --hits option.

Feel free to close this issue if this feature request is beyond the scope of haplogrep itself. Thanks!

seppinho commented 4 years ago

Hi Stephen, sorry for my late response. Makes sense to implement this directly in Haplogrep. I already talked to Pete at ASHG about this feature. For haplocheck we already implemented a method to find the LCA of two haplogroups (major and minor). Nevertheless, If you have a code snippet ready, would be great to have a look at.

stephenturner commented 4 years ago

We've been doing this externally using some R code to traverse phylotree, which was put into a JSON structure using some python code Pete found. We'll try to clean this up and provide an example. cc @vpnagraj

vpnagraj commented 4 years ago

@seppinho thanks for circling back to this!

yeah we put together a method for calculating mito haplogroup lca in R.

it's a bit involved:

## script to define functions for least common ancestor based on haplogroup phylogeny

## helper functions for recursively parsing list
## https://stackoverflow.com/a/47078913

sub_list <- function(x, split = "\\.", comparison) {
  sl <- stringr::str_split(x, split, simplify = TRUE)
  sl <- sl[sl %in% comparison]
  paste(sl, collapse = ".")
}

access_children <- function(x, search) {
  # extracting the nth node (search) from the nodeset x
  lapply(search, xml2::xml_child, x = x)
}

hierarchy_builder <- function(nodes) {
  # if we reach the leaf level of any of the node sets,
  # just return an empty string
  if (length(nodes) == 0L) {
    return("")
  }
  # extract the names of each of the current top level nodes
  names(nodes) <- sapply(nodes, xml2::xml_attr, attr = 'Id')
  # count the children each of the current top level node has, make a sequence
  seq_ix <- lapply(nodes, function(node) {
    seq(xml2::xml_children(node))
  })
  # make a list of individual child nodes under each of the current top level
  # nodes, while preserving the hierarchy
  children <- mapply(access_children, x = nodes, search = seq_ix, SIMPLIFY = FALSE)
  # recurse on the current node's children
  return(lapply(children, hierarchy_builder))
}

## function to parse phylogeny
## note: uses helpers defined above
parse_phylogeny <- function(input_file, type = "json", haplogroups) {
  if(type == "json") {
    json_data <- jsonlite::read_json(input_file)
    tree_with_dots <- names(rapply(json_data, length))
    tree_with_dots <- sapply(tree_with_dots,
                             function(x) sub_list(x, comparison = haplogroups),
                             USE.NAMES = FALSE)
    tree_with_dots <- unique(tree_with_dots)
    tree_with_dots <- sort(tree_with_dots)
  } else if (type == "xml") {
    xml_data <- xml2::read_xml(input_file)
    l <- hierarchy_builder(list(xml_data))
    ## recursively apply to get name of each list item
    ## name should inherit ancestors with dots in between haplogroup names
    tree_with_dots <- names(rapply(l,length))
    ## clean up the above
    ## note this removes any 'NoLabel' positions in the hierarchy
    tree_with_dots <- sapply(tree_with_dots,
                             function(x) gsub("mtDNAPhylogeny\\.|NoLabel\\.", "", x),
                             USE.NAMES = FALSE)
    tree_with_dots <- sort(tree_with_dots)
  }
  res_list <- list()
  ## iterate over this tree vector
  ## find distance of every leaf to each of its ancestors
  ## store results as a list of data frames
  for(i in 1:length(tree_with_dots)) {
    ## split the hierarchy on the . (default separator for names in list)
    hier_split <- unlist(strsplit(tree_with_dots[i], "\\."))
    ## calculate distance between the leaf and all ancestors
    hier_dist <- length(hier_split) - 1:length(hier_split)
    ## bully this into a data.frame ...
    dat <- as.data.frame(t(data.frame(hier_dist, row.names = hier_split)), row.names = FALSE)
    ## store in list
    res_list[[i]] <- dat
    ## make sure list is named as the leaf at that element
    names(res_list)[i] <- dplyr::last(hier_split)
    ## onto the next one in the loop ...
  }
  return(res_list)
}

## least common ancestor function
lca <- function(phylogeny, haplogroups) {
  ## subset the parsed phylogeny list
  reduced_phylogeny <- phylogeny[unlist(names(phylogeny) %in% haplogroups)]
  ## combine the above as a matrix or distances
  distances <- do.call(dplyr::bind_rows, reduced_phylogeny)
  ## find the common ancestor with the minimum distance
  ret <- names(distances)[which.min(colSums(distances))]
  ## if no common ancestor found and only one haplogroup provided use that haplogroup
  if(length(ret) == 0 && length(haplogroups) == 1) {
    return(haplogroups)
    ## if no common ancestor found and more than one haplogroup return NA
  } else if (length(ret) == 0 && length(haplogroups) > 1) {
    return(NA)
    ## otherwise return the least common ancestor
  } else {
    return(ret)
  }
}

you'll need a text file with all possible haplogroups:

all-possible-haplogroups.txt

and a copy of the phylotree data in json format (not attached but there are parsers out there: https://github.com/munky69rock/phylotree-parser-python).

i guess i should also note that i did see that you all have an xml version of the tree. some of the code above tries to parse an xml input, but i didn't have reliable success with that and moved forward with json.

anyways, with the above you can run the lca:

library(tidyverse)

all_haplogroups <- read_lines("all-possible-haplogroups.txt")

## need to get phylotree data as json
## use something like this: https://github.com/munky69rock/phylotree-parser-python
pyhl_json <- "/path/to/phyl.json"

phyl <- parse_phylogeny(input_file = phyl_json,
                             type = "json",
                             haplogroups = all_haplogroups)

## contrived example
## the least common ancestor for L1b1a3a1 and L1b1a1?
lca(phylogeny = phyl,
    haplogroups = c("L1b1a3a1", "L1b1a1"))

hope this gives you an idea of what we have in mind. please let us know if there's anything else you need on our end.

NNCCMM commented 4 years ago

Dear Dr seppinho :

I try to use fantastic tool-Haplogrep2 from your GitHub and I have some questions .

On the graphical Haplogrep2 web tool, I have seen “Errors and warnings” option, how can I use command-line to output it?
what does the quality score mean for each sample , any cutoff score to remove low quality haplogroup sample？

Thanks !

Best regards,

seppinho / haplogrep-cmd

Feature request: least common ancestor for equally scoring haplogroups #31