Closed rcannood closed 1 year ago
I'm not really sure. It might be something from the original paper or possibly someone decided that "random" should be the floor for the score.
if random should be the floor, it would still need a rescaling. As task lead, what would you prefer @lazappi ?
Imo I think it should be clipped to 0. If it performs worse than random, then the downstream rescaling will take of it in whichever way we decide rescaling should be done :)
I agree with this in principal but I think we should check the original publication to see if there is a reason they defined it that way.
The original code is:
def coranking_matrix_metrics(Q):
Q = Q[1:,1:]
m = len(Q)
T = np.zeros(m-1) # trustworthiness
C = np.zeros(m-1) # continuity
QNN = np.zeros(m) # Co-k-nearest neighbor size
LCMC = np.zeros(m) # Local Continuity Meta Criterion
for k in range(m-1):
Qs = Q[k:,:k]
W = np.arange(Qs.shape[0]).reshape(-1, 1) # a column vector of weights. weight = rank error = actual_rank - k
T[k] = 1-np.sum(Qs * W)/(k+1)/m/(m-1-k) # 1 - normalized hard-k-intrusions. lower-left region. weighted by rank error (rank - k)
Qs = Q[:k,k:]
W = np.arange(Qs.shape[1]).reshape(1, -1) # a row vector of weights. weight = rank error = actual_rank - k
C[k] = 1-np.sum(Qs * W)/(k+1)/m/(m-1-k) # 1 - normalized hard-k-extrusions. upper-right region
for k in range(m):
QNN[k] = np.sum(Q[:k+1,:k+1])/((k+1) * m) # Q[0,0] is always m. 0-th nearest neighbor is always the point itself. Exclude Q[0,0]
LCMC[k] = QNN[k] - (k+1)/(m-1)
kmax = np.argmax(LCMC)
Qlocal = np.sum(QNN[:kmax+1])/(kmax + 1)
Qglobal = np.sum(QNN[kmax:-1])/(m - kmax -1) # skip the last. The last is (m-1)-nearest neighbor, including all samples.
AUC = np.mean(QNN)
return T, C, QNN, AUC, LCMC, kmax, Qlocal, Qglobal
In the v2, I spent some time giving better names for each of the metrics and finding out what the actual references for each of the metrics are: see here
I was refactoring the Python code accordingly, but remembered that there's an R package that computes exactly what we need so I went with that implementation instead. see here
https://github.com/openproblems-bio/openproblems/blob/14d70b330cae09527a6d4c4e552db240601e31cf/openproblems/tasks/dimensionality_reduction/metrics/nn_ranking.py#L192-L195
When worse than random, an AUC can be below 0.5, no?
Thanks for clarifying!