openproblems-bio / openproblems

Formalizing and benchmarking open problems in single-cell genomics
MIT License
314 stars 78 forks source link

Why is qnn_auc clipped between 0.5 and 1.0? #781

Closed rcannood closed 1 year ago

rcannood commented 1 year ago

https://github.com/openproblems-bio/openproblems/blob/14d70b330cae09527a6d4c4e552db240601e31cf/openproblems/tasks/dimensionality_reduction/metrics/nn_ranking.py#L192-L195

When worse than random, an AUC can be below 0.5, no?

Thanks for clarifying!

lazappi commented 1 year ago

I'm not really sure. It might be something from the original paper or possibly someone decided that "random" should be the floor for the score.

LuckyMD commented 1 year ago

if random should be the floor, it would still need a rescaling. As task lead, what would you prefer @lazappi ?

rcannood commented 1 year ago

Imo I think it should be clipped to 0. If it performs worse than random, then the downstream rescaling will take of it in whichever way we decide rescaling should be done :)

lazappi commented 1 year ago

I agree with this in principal but I think we should check the original publication to see if there is a reason they defined it that way.

rcannood commented 1 year ago

The original code is:

def coranking_matrix_metrics(Q):

    Q = Q[1:,1:]
    m = len(Q)

    T = np.zeros(m-1) # trustworthiness
    C = np.zeros(m-1) # continuity
    QNN = np.zeros(m) # Co-k-nearest neighbor size
    LCMC = np.zeros(m) # Local Continuity Meta Criterion

    for k in range(m-1):
        Qs = Q[k:,:k]
        W = np.arange(Qs.shape[0]).reshape(-1, 1) # a column vector of weights. weight = rank error = actual_rank - k
        T[k] = 1-np.sum(Qs * W)/(k+1)/m/(m-1-k)  # 1 - normalized hard-k-intrusions. lower-left region. weighted by rank error (rank - k)
        Qs = Q[:k,k:]
        W = np.arange(Qs.shape[1]).reshape(1, -1) # a row vector of weights. weight = rank error = actual_rank - k
        C[k] = 1-np.sum(Qs * W)/(k+1)/m/(m-1-k)  # 1 - normalized hard-k-extrusions. upper-right region 

    for k in range(m):    
        QNN[k] = np.sum(Q[:k+1,:k+1])/((k+1) * m) # Q[0,0] is always m. 0-th nearest neighbor is always the point itself. Exclude Q[0,0]
        LCMC[k] = QNN[k] - (k+1)/(m-1)

    kmax = np.argmax(LCMC)
    Qlocal = np.sum(QNN[:kmax+1])/(kmax + 1)
    Qglobal = np.sum(QNN[kmax:-1])/(m - kmax -1) # skip the last. The last is (m-1)-nearest neighbor, including all samples.
    AUC = np.mean(QNN)

    return T, C, QNN, AUC, LCMC, kmax, Qlocal, Qglobal

In the v2, I spent some time giving better names for each of the metrics and finding out what the actual references for each of the metrics are: see here

I was refactoring the Python code accordingly, but remembered that there's an R package that computes exactly what we need so I went with that implementation instead. see here