theislab / cellrank

CellRank: dynamics from multi-view single-cell data
https://cellrank.org
BSD 3-Clause "New" or "Revised" License
347 stars 46 forks source link

Threshold WOT transition matrix #686

Closed WeilerP closed 3 years ago

WeilerP commented 3 years ago

Description

The WOT transition matrix is only block sparse slowing things down e.g. the Schur decomposition in GPCCA. It seems that many transition probabilities are pretty small s.t. we could optionally set all probabilities below a certain threshold to zero.

Marius1311 commented 3 years ago

We should check (and quantify) in one application how much results are affected by this. I.e compare the spectrum and the macrostates.

michalk8 commented 3 years ago

@WeilerP could you please finish this be looking at what Marius suggested? I've added the implementation in #696. Also, when all 0 row happens because of thresholding, the values are set to uniform, i.e.

[[0.3, 0.3]
 [1, 0]]
threshold = 1 ->
[[1/2, 1/2],
 [1, 0]]

Think another alternative is to set these all 0 indices to be self-transition, but imho this seems more "natural"

Marius1311 commented 3 years ago

Do you re-normalize after thresholding?

WeilerP commented 3 years ago

Do you re-normalize after thresholding?

Yes, the transition matrix is renormalized.

I tested @michalk8's implementation on the Schiebinger reprogramming data.

In a nutshell

Results

Computation time

Schur decomposition Computing macrostates Computing absorption probabilities
No threshold 57 sec 23 sec 60 sec
Threshold 1e-4 6 sec 7 sec 6 sec
Threshold 1e-2 1 sec 5 sec 1 sec

Absorption probabilities

No thresholding

absorp_prob_no_threshold

Threshold 1e-4

absorp_prob_threshold_1e-4

Threshold 1e-2

absorp_prob_threshold_1e-2

Simulated random walks

No thresholding

rw_no_threshold

Threshold 1e-4

rw_threshold_1e-4

Threshold 1e-2

rw_threshold_1e-2

Stream embedding

No thresholding

stream_no_threshold

Threshold 1e-4

stream_threshold_1e-4

Threshold 1e-2

stream_threshold_1e-2
Marius1311 commented 3 years ago

Looks good from my point of view. Two things come to mind:

michalk8 commented 3 years ago

thresholding based on percentile? (which percentiles do your threshold correspond to?)

Was wondering, whether percentile thresholding isn't more convenient then just fixed number - might be better to specify e.g. we want to zero-out lowest 5% of the values. If yes, I will make the changes, otherwise the PR is ready to be merged (once the effect has been quantified and looks good).

Marius1311 commented 3 years ago

Hi @michalk8, I was also thinking about this. Let's wait for @WeilerP, we talked about this and he'll look into it.

WeilerP commented 3 years ago

To check the number of non-zero elements, the percentiles and the correlation between the absorption probabilities, I ran the WOT pipeline on the Schiebinger reprogramming data using [0. , 0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.007, 0.008, 0.009, 0.01 ] as different thresholds.

Non-zero entries and percentile

The left plot displays the decrease in average non-zero transition probabilities, the right plot the percentile used. We use 1 - percentile data points in average. Both curves increase fast indicating that a lot of non-zero transition probabilities are small. Using threshold=0.01, the mean number of non-zero entries dropped to about 19. No thresholding includes approx 1192 non-zero entries (in average).

nnz_percentile

Correlation

The plots show the correlation between the absorption probabilities towards the four terminal states with and without a threshold. The correlation only drops marginally when increasing the threshold to 0.01, IMO.

abs_prob_correlation

Remarks

I personally believe that we should either specify the number of non-zero transition probabilities or the percentile. Otherwise, the user will have to check manually the ballpark of reasonable thresholds. Specifying the non-zero transition probabilities would mimic setting the number of neighbors for kNN based kernels.

Q1: Should we symmetrize the transition matrix after thresholding similar to transition matrices using kNN based kernels? Q2: When specifying a threshold/number of non-zero entries/percentile, should we readjust the threshold s.t. no zero rows appear? This goes into the direction of @michalk8's question how to handle these cases. ATM, we use a uniform distribution.

michalk8 commented 3 years ago

Q1: Should we symmetrize the transition matrix after thresholding similar to transition matrices using kNN based kernels?

Wouldn't symmetrizing mean that we allow connections to past cells?

Q2: When specifying a threshold/number of non-zero entries/percentile, should we readjust the threshold s.t. no zero rows appear? This goes into the direction of @michalk8's question how to handle these cases. ATM, we use a uniform distribution.

In principle, we could introduce threshold = 'auto' so that no such row appears. In any case, I'd always go for a percentile, since for specifying #nnz you'd have to have intuition/estimate of how many of them are there initially.

WeilerP commented 3 years ago

Q1: Should we symmetrize the transition matrix after thresholding similar to transition matrices using kNN based kernels?

Wouldn't symmetrizing mean that we allow connections to past cells?

Sorry, forget I ever said anything. I meant symmetrizing within a time point, i.e. each block on the diagonal. But it doesn't make any sense. A kNN graph doesn't contain nay directionality but the transition matrix does.

Q2: When specifying a threshold/number of non-zero entries/percentile, should we readjust the threshold s.t. no zero rows appear? This goes into the direction of @michalk8's question how to handle these cases. ATM, we use a uniform distribution.

In principle, we could introduce threshold = 'auto' so that no such row appears. In any case, I'd always go for a percentile, since for specifying #nnz you'd have to have intuition/estimate of how many of them are there initially.

Hm, specifying the number of non-zero transition probabilities would be fine, I think. I thought the user would only provide "reasonable" number, e.g. 30 or so, i.e. much smaller than the number of potential next steps.

michalk8 commented 3 years ago

close via #696