Closed mumichae closed 3 years ago
Did we really want to split cLISI and iLISI? These are computed in the same function, no?
Did we really want to split cLISI and iLISI? These are computed in the same function, no?
I wouldn't split cLISI and iLISI, either. In fact, I would rather merge the computation of both cLISI and iLISI, because the steps are the same, just a different label to compute the metric on - it would save quite some computation time.
From what I can see, the only time saved is during recomputing the kNN graph. If we move the control of preprocessing outside of the ilisi_graph
and clisi_graph
functions, that time could be halved. Also, I didn't think that was the most time consuming step (but would have to double-check). In the original implementation all the LISI computations (using the C code) are called separately for iLISI
and cLISI
anyway, so splitting up the code shouldn't change much.
Additionally, the old lisi_graph
function works as before, computing both scores, so we could use that for the metrics
function if we wanted. I just thought it would be best to have the same level of control over iLISI
an cLISI
as we would have over all the other metrics when using the API.
From what I can see, the only time saved is during recomputing the kNN graph. If we move the control of preprocessing outside of the
ilisi_graph
andclisi_graph
functions, that time could be halved. Also, I didn't think that was the most time consuming step (but would have to double-check). In the original implementation all the LISI computations (using the C code) are called separately foriLISI
andcLISI
anyway, so splitting up the code shouldn't change much.Additionally, the old
lisi_graph
function works as before, computing both scores, so we could use that for themetrics
function if we wanted. I just thought it would be best to have the same level of control overiLISI
ancLISI
as we would have over all the other metrics when using the API.
The thing with the two LISI functions is that the batch_key
/label_key
info is only used at the very end of compute_simpson_index_graph
:
batch = batch_labels[knn_idx]
B = convertToOneHot(batch, n_batches)
sumP = np.matmul(P, B) # sum P per batch
simpson[i[0]] = np.dot(sumP, sumP) # sum squares
So the most efficient way to compute both iLISI
and cLISI
would be to add both batch and label keys at this piece of code. However, then both functions are invariably intertwined.
The thing with the two LISI functions is that the
batch_key
/label_key
info is only used at the very end ofcompute_simpson_index_graph
:batch = batch_labels[knn_idx] B = convertToOneHot(batch, n_batches) sumP = np.matmul(P, B) # sum P per batch simpson[i[0]] = np.dot(sumP, sumP) # sum squares
So the most efficient way to compute both
iLISI
andcLISI
would be to add both batch and label keys at this piece of code. However, then both functions are invariably intertwined.
So that means that the old implementation of the score wasn't the most efficient either, because we call the lisi_graph_py
and with that the compute_simpson_index_graph
twice anyway.
Would it be better if we extracted the compute_simpson_index_graph
functionality outside of lisi_graph_py
, compute it once and then pass the output for cLISI and iLISI computations?
In that case I'd suggest we optimise the lisi_graph
function, but still keep separate cLISI and iLISI functions for the API. The latter could have an option to preprocess the simpson index and pass it to those functions. I hope that isn't overkill...
I think this PR should be ready to merge and we can address LISI computation optimisations in a different PR. Any comments or wishes to the current state?
This PR also fixes, #266, so it'd be really great to see it merged 😄
@mbuttner Could you go over the latest changes and approve if the PR is ready to be merged?
Once @mbuttner reviews, you can merge. All good from my side now.
Clean up LISI and kBET metrics.