junseonghwan / PhylEx

3 stars 1 forks source link

Handle large number of SNVs. #5

Closed junseonghwan closed 4 years ago

junseonghwan commented 4 years ago

Currently, re-assigning an SNV results in computation of single cell cache.

junseonghwan commented 4 years ago

Keep track of two separate vectors of SNVs, one with single cell coverage and one without.

If new nodes are created as a result of assigning SNVs without single cell coverage, the single cell likelihood now has a new node to consider for assignment. Therefore, not updating cache for re-assignment of these SNVs can invalidate the single cell likelihood.

However, the cache will be correctly updated for next SNV to be re-assigned with single cell coverage.

junseonghwan commented 4 years ago
junseonghwan commented 4 years ago

Run a simulated experiments with 1000 somatic SNVs with dropout equal to 0.99 (majority of the somatic SNVs are not going to have single cell coverage).

junseonghwan commented 4 years ago

Single cell likelihood calculation: need a fast way to check for presence/absence of mutation s at node v.

junseonghwan commented 4 years ago

It turns out that unordered_map does not have O(1) access time, which results in becoming a bottleneck for performance when storing loglikelihood values for each cell at each node.

We can move single cell cache to each node. Each node can store a vector of double values, one for each cell.

junseonghwan commented 4 years ago

Accessing Eigen matrix is taking quite a bit of time. Pre-compute and store the single cell likelihood as a vector of vector.