tskit-dev / msprime

Simulate genealogical trees and genomic sequence data using population genetic models
GNU General Public License v3.0
171 stars 83 forks source link

Impute times for input pedigrees #1916

Open jeromekelleher opened 2 years ago

jeromekelleher commented 2 years ago

It would be very useful to be able to support pedigrees that have either no time associated, or partial time values. We avoided doing this in the initial implementation as it's not totally obvious how to do this well in general, and so it's better to leave the functionality out rather than have to make breaking changes later. There's also the problem of rigorously testing it, which takes time to do well.

The prototype pedigree interface had an implementation of this here which looked like this:

    def get_times(individual, parent_IDs=None, parents=None, check=False):
        """
        For pedigrees without specified times, crudely assigns times to
        all individuals.
        """
        if parents is None and parent_IDs is None:
            raise ValueError("Must specify either parent IDs or parent indices")

        if parents is None:
            parents = Pedigree.parent_ID_to_index(individual, parent_IDs)

        time = np.zeros(len(individual))
        all_indices = range(len(individual))
        proband_indices = set(all_indices).difference(parents.ravel())
        climber_indices = proband_indices

        t = 0
        while len(climber_indices) > 0:
            next_climbers = []
            for c_idx in climber_indices:
                if time[c_idx] < t:
                    time[c_idx] = t

                next_parents = [p for p in parents[c_idx] if p >= 0]
                next_climbers.extend(next_parents)

            climber_indices = list(set(next_climbers))
            t += 1

        if check:
            Pedigree.check_times(individual, parents, time)

        return time

@LukeAndersonTrocme you said you have an R script for doing something similar in https://github.com/tskit-dev/msprime/issues/1915#issuecomment-968361068, would you mind posting here please?

LukeAndersonTrocme commented 2 years ago
# Get the deepest generation for each internal node
maximum_genealogical_depth <- function(pedigree, list_of_probands) {
  # probands are individuals at generation 0
  individuals <- data.frame(ind = list_of_probands, generation = 0)
  # initialize output to append to downstream
  gen_depth <- individuals     
  # for loop to iterate through generations
  for(k in 1:18) {             

    individuals_parents <- 
      # start from the entire pedigree
      pedigree %>%
      # keep parents of individuals at generation 'k'
      filter(ind %in% individuals$ind) %>%
      # if parent is missing, flag individual as founder
      mutate(founder = ifelse(is.na(father) | is.na(mother),T, F))

    # stop climbing when there are only founders left
    if(nrow(filter(individuals_parents, founder == F)) == 0){break}

    # to ascend genealogy, parents become reference individuals
    mothers <- unique(individuals_parents$mother)
    fathers <- unique(individuals_parents$father)
    # recursion step reassigns reference individuals
    individuals <- 
      data.frame(ind = c(mothers, fathers), generation = k) %>%
      filter(!is.na(ind)) # removes missing IDs

    # output max depth per individual
    gen_depth <- 
      # individuals already visited by climbing algorithm
      gen_depth %>%
      # remove duplicate individuals at 'k-1'
      filter(!ind %in% individuals$ind) %>%
      # append updated individuals 'k' generations deep
      bind_rows(individuals)                                                
  }
  return(gen_depth)
}
LukeAndersonTrocme commented 2 years ago

I'll note here that I knew in advance that the French-Canadian genealogy was less than 18 generations hence the for loop, but this could be done with a while loop for a more general solution