openproblems-bio / openproblems

Formalizing and benchmarking open problems in single-cell genomics
MIT License
287 stars 76 forks source link

[spatial decomposition] metrics to be used #261

Open giovp opened 3 years ago

giovp commented 3 years ago

@hiraksarkar @almaan @vitkl here to discuss which metrics and the aggregation strategy

almaan commented 3 years ago

I think that the two metrics that @vitkl suggested are nice to include:

We also discussed including JSD (Jensen Shannon Divergence) which treats the cell type proportions more as a distribution across the different spots, it's also preferable to the KLD since it's lower and upper bounded with zero respectively one (when using base 2 in the logarithm) which is not true for KLD. However, thinking more about this I think that the JSD could cause some issues if a cell type has zero estimated probability and true probability of being in a spot (would cause a zero division), see link. Would propose to use either of.

For these two metrics there would potentially be two alternative approaches:

  1. Compute the average distance between each pair of proportion vectors and take the average of this
  2. Normalize the proportion values for a cell type across all spots (so they sum to one) and measure the distance over the whole dataset.
vitkl commented 3 years ago

I think it is important for metrics to be easily interpretable by the users.

PR macro-average across cell types represents the accuracy/sensitivity at detecting cell abundance > 0, with PR curves averaged across cell types. It is not great for cell types that are expected to be absent in all locations because the PR curve cannot be computed.

R^2 - represents the consistency of estimated and ground-truth cell proportion, + many people are used to looking at scatterplots and R^2.

@almaan can you explain the metrics you proposed in 'plain English'?

almaan commented 3 years ago

Fully agree,

Bhattacharyya coefficient - measures the amount of overlap or similarity between two distributions, by integration or summation over the probability space. Here the distributions we are looking at would either be how a cell type is distributed across all spots (comparing true vs predicted), or how the cell types are distributed within each spot, then computing the average of these coefficients as per (2). The interpretation of the latter would be the average similarity between true and predicted cell type proportions in each spot.

Hellinger distance - very similar to the Bhattacharyya coefficient but forms a proper distance metric, also looks as distance between distributions.

Still, to me R^2 is a dead given, hence why I implemented it in the latest PR.