theislab / chemCPA

Code for "Predicting Cellular Responses to Novel Drug Perturbations at a Single-Cell Resolution", NeurIPS 2022.
https://arxiv.org/abs/2204.13545
MIT License
88 stars 23 forks source link

How to calculate the uncertainty #112

Closed tuln128 closed 1 year ago

tuln128 commented 1 year ago

Dear authors, Thank you very much for sharing such a nice tool as chemCPA. In the article, you have mentioned about the calculation of the uncertainty as the following:

image

If possible, could you please explain a little bit more about the definition (and/or calculation) of X, which is mentioned as "the normalised pathway prediction from the neighbours of drug i"?

Thank you very much in advance Kind regards,

MxMstrmn commented 1 year ago

Hi @tuln128,

We compute the entropy as follows:

def entropy(column, base=None):
    vc = pd.Series(column).value_counts(normalize=True, sort=False)
    base = np.exp if base is None else base
    return -(vc * np.log(vc) / np.log(base)).sum()

So for a drug i, we take 10 neighbours in the latent space and use the pathway labels as indication for embedding quality. If all neighbours come from the same pathway, the entropy will by low and the prediction good. If they come from multiple pathways, we assume that there is some uncertainty about the drug embedding.

You can also check it here in this notebook: https://github.com/theislab/chemCPA/blob/a4a4ded0c3b949c64ff1ea51033be1b7c301c36b/notebooks/chemCPA_Table_4.py#L324

tuln128 commented 1 year ago

Hi @MxMstrmn,

Thank you very much for the detailed explanation. I could figure out how H(X) is calculated from the link you shared.

According to the following reference: https://github.com/theislab/chemCPA/blob/a4a4ded0c3b949c64ff1ea51033be1b7c301c36b/notebooks/chemCPA_Table_4.py#L452

the sum of distances is calculated before taking log:

adata.obs.loc[adata.obs.drug == adata.obs.drug.iloc[i], "uncertainty"] = ( 1 / np.log(distances[i].sum()) * entropy(pathways, base=2))

which somehow is opposite to the definition mentioned above. Could you please explain a bit more about this difference or correct me if I misunderstood?

Thank you very much in advance, and sorry for bothering so much! Kind regards,