lineagekit / lineagekit

A python library with genealogy methods
0 stars 0 forks source link

Kinship calculation for probands #2

Open sgravel opened 1 month ago

sgravel commented 1 month ago

sgkit has a kinship calculator here: https://github.com/pystatgen/sgkit/blob/main/sgkit/stats/pedigree.py

This is using the formula from page 762 in Lynch and Walsh, where: we traverse the graph one individual at a time such that parents are traversed before offspring. We then compute the kinship of each individual j to everyone already traversed using the kinship of parents of j.

This works fine, but in large pedigrees requires too much memory to remember the full kinship matrix, especially if we only need the kinship among probands.

Proposed optimization 1: reduce the accuracy of the matrix.

Optimization 2: do not keep track of intermediate results. To do this, note that parent kinships are only used when their children are processed. So we can discard the parent kinship information after their children are processed:

1- identify probands. 2- preprocess the graph to figure out how many offspring each parent has in the ascending pedigree. 3- delete all entries corresponding to a parent after all the offsping to that parent have been computed.
4- This will work especially well if we find a good topological order such that children tend to be processed as soon as possible after their parents, so that we can delete the parent values.

Some notes: this will require a fair bit of extra work (deleting rows and columns). A benefit is that we will be much more memory efficiet, and also that we will have way fewer updates to compute. In terms of data structures, I'm assuming that this means that a np.array will not be most efficient. A dict with ancestors as keys would work fine, but maybe a bit slower.

sgravel commented 1 month ago

A proposal of topological ordering. The idea is to quickly process individuals with few offspring, so that we can delete them faster. In particular, we will tend process individuals with a single offspring rapidly. sorted_stack_k will be a sorted stack of individuals who have k parents left to process (k=0,1,2) sort founders by ascending number of offspring. put in sorted_stack_0; while not done{ pop first element in sorted_stack_0 bring their offspring from sorted_stack_k to sortedstack{k-1}. }

We could probably do better by processing parents as pairs, and there may be better algos.

sgravel commented 1 month ago

also explore other existing algos such as Lange's

sgravel commented 1 month ago

Lange's algorithm appears to be the same: https://cran.r-project.org/web/packages/kinship2/vignettes/kinship_code_details.html

Serdiuk-Andrii commented 1 month ago

Recent paper