lineagekit / lineagekit

A python library with genealogy methods
0 stars 0 forks source link

Kinship calculation for probands, possible improvements #14

Open Serdiuk-Andrii opened 2 weeks ago

Serdiuk-Andrii commented 2 weeks ago

There are multiple ways of improving the current version of the proband kinship calculation algorithm.

1. Parallelization When we add a vertex to the cut, we need to calculate its kinship with everybody else in the cut to maintain the invariant of having an all-pairwise kinship among the vertices in the cut. This can potentially be seriously sped up using parallelization. For example, we can evenly distribute the vertices in the cut among all the worker threads, and each worker thread will calculate the kinships for its portion of the vertices.

The only potential problem that we might face here is blocking. This is highly dependent on the strategy that we use for storing the kinship values. Since we are trying to minimize the memory requirements as much as possible, for two vertices u and v in the cut, only one of the two pairs ((u, v) or (v, u)) is valid. In other words, the kinship is stored in the u's hashmap with v being the key or vice versa.

Let's consider the scenario when we add the vertex x to the cut. We might decide to store the resulting kinships in all the other hashtables. In other words, the x's hashtable will only contain the x's self-kinship. In this case, every hashtable will only be written once by one of the threads. To avoid blocking hashtables for reading, we can create two (or fewer) dictionaries containing the x's parents' kinships that will be used only for look-ups. This way, there should be no blocking necessary, as we will have no race conditions.

One of the problems with this approach is hashmap leveling, which is mentioned below.

Notice that the parallel hashmap should be efficiently parallelizable even with some blocking, as the chance of blocking is lower when the number of sub-hash maps is bigger

2. Hashmap leveling Hashmaps mostly use arrays as the data structure under the hood, which tend to double when the number of elements is too large. We can distribute the values in such a way that minimizes the number of resizing operations. For example, the following approach can be used:

When inserting the (u, v) kinship: — If only one of the hashtables needs to resize after the insertion, save the kinship in the other hashtable — Otherwise, choose the one with the smaller capacity

3. Different data structures We can also consider using different hashtable implementations that don't use a resizable array as the data structure under the hood. There are also approaches for storing sparse matrices in a different format (not as a hashtable of hashtables), but they are usually much slower

4. Vertical cuts The current version of the algorithm uses the idea of horizontal cuts, which allows us to reduce the memory requirements of the program. Indeed, we can always remove the parents' kinship once we know that their children have been added to the cut. One can easily prove by contradiction that removing such parents will not break the cut (in other words, it will still separate every path from a source to a sink)

We can also try to implement the horizontal cuts! A horizontal cut can be defined simply as a set of cut vertices (i.e., a set of vertices whose removal increases the number of connected components in the graph). The main difference here is that a horizontal cut is not guaranteed to separate every path from a source to a sink.

Having found a horizontal cut, we can reduce the number of kinship calculations. In the figure, [x] forms a vertical cut separating the sets A and B. To calculate k(p1, p2), it is sufficient to know k(x, p1) and k(x, p2). Therefore, we can only calculate the "inner" kinships in A and B and use this information to calculate k(p1, p2)

image

5. Equivalence class simplification One can group all the full siblings into an equivalence class and replace the whole class with just one vertex from this class, which will behave as the representative of this class (this is a fundamental idea in mathematics). As the vertex's kinship with another vertex depends on this vertex's parents' kinship, all the vertices from the equivalence class will have the same kinship with someone not from their descending genealogy

Unfortunately, the same property does not hold for the vertices in the descending genealogy of the equivalence class. The representative of the class and a vertex from this equivalence class have different levels of contribution to an individual in the descending genealogy. In this case, k(r, x) != k(v, x) where r is the representative of a class, x is a vertex from the same class, and x is a vertex from the descending genealogy of this class

We could try storing the "outer" kinships only for the representative of this class, but store all the kinships for the descending genealogy. This idea also resonates with the "vertical cuts" approach, as, for every vertex, we are dividing the whole pedigree into two parts: the descending genealogy and everybody else

6. Implementing the approximating version of the algorithm In the Brent Kirkpatrick paper, there is also an approximate version of the algorithm, which is said to be linear

7. Optimizing the cut size with different cutting strategies We can use the max-flow algorithm for calculating the min-cut of the graph. Then, we can remove the obtained cut, and rerun the algorithm to obtain the series of minimal cuts. This can potentially reduce the maximum cut size

8. Using GPU

Serdiuk-Andrii commented 2 weeks ago

Interesting paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4130148/pdf/CMMM2014-898424.pdf