theislab / scib

Benchmarking analysis of data integration tools
MIT License
294 stars 63 forks source link

Refactor metrics #265

Closed mumichae closed 3 years ago

mumichae commented 3 years ago

Clean up LISI and kBET metrics.

LuckyMD commented 3 years ago

Did we really want to split cLISI and iLISI? These are computed in the same function, no?

mbuttner commented 3 years ago

Did we really want to split cLISI and iLISI? These are computed in the same function, no?

I wouldn't split cLISI and iLISI, either. In fact, I would rather merge the computation of both cLISI and iLISI, because the steps are the same, just a different label to compute the metric on - it would save quite some computation time.

mumichae commented 3 years ago

From what I can see, the only time saved is during recomputing the kNN graph. If we move the control of preprocessing outside of the ilisi_graph and clisi_graph functions, that time could be halved. Also, I didn't think that was the most time consuming step (but would have to double-check). In the original implementation all the LISI computations (using the C code) are called separately for iLISI and cLISI anyway, so splitting up the code shouldn't change much.

Additionally, the old lisi_graph function works as before, computing both scores, so we could use that for the metrics function if we wanted. I just thought it would be best to have the same level of control over iLISI an cLISI as we would have over all the other metrics when using the API.

mbuttner commented 3 years ago

From what I can see, the only time saved is during recomputing the kNN graph. If we move the control of preprocessing outside of the ilisi_graph and clisi_graph functions, that time could be halved. Also, I didn't think that was the most time consuming step (but would have to double-check). In the original implementation all the LISI computations (using the C code) are called separately for iLISI and cLISI anyway, so splitting up the code shouldn't change much.

Additionally, the old lisi_graph function works as before, computing both scores, so we could use that for the metrics function if we wanted. I just thought it would be best to have the same level of control over iLISI an cLISI as we would have over all the other metrics when using the API.

The thing with the two LISI functions is that the batch_key/label_key info is only used at the very end of compute_simpson_index_graph:

batch = batch_labels[knn_idx]
B = convertToOneHot(batch, n_batches)
sumP = np.matmul(P, B)  # sum P per batch
simpson[i[0]] = np.dot(sumP, sumP)  # sum squares

So the most efficient way to compute both iLISI and cLISI would be to add both batch and label keys at this piece of code. However, then both functions are invariably intertwined.

mumichae commented 3 years ago

The thing with the two LISI functions is that the batch_key/label_key info is only used at the very end of compute_simpson_index_graph:

batch = batch_labels[knn_idx]
B = convertToOneHot(batch, n_batches)
sumP = np.matmul(P, B)  # sum P per batch
simpson[i[0]] = np.dot(sumP, sumP)  # sum squares

So the most efficient way to compute both iLISI and cLISI would be to add both batch and label keys at this piece of code. However, then both functions are invariably intertwined.

So that means that the old implementation of the score wasn't the most efficient either, because we call the lisi_graph_py and with that the compute_simpson_index_graph twice anyway.

Would it be better if we extracted the compute_simpson_index_graph functionality outside of lisi_graph_py, compute it once and then pass the output for cLISI and iLISI computations? In that case I'd suggest we optimise the lisi_graph function, but still keep separate cLISI and iLISI functions for the API. The latter could have an option to preprocess the simpson index and pass it to those functions. I hope that isn't overkill...

mumichae commented 3 years ago

I think this PR should be ready to merge and we can address LISI computation optimisations in a different PR. Any comments or wishes to the current state?

gokceneraslan commented 3 years ago

This PR also fixes, #266, so it'd be really great to see it merged 😄

mumichae commented 3 years ago

@mbuttner Could you go over the latest changes and approve if the PR is ready to be merged?

LuckyMD commented 3 years ago

Once @mbuttner reviews, you can merge. All good from my side now.