github / CodeSearchNet

Datasets, tools, and benchmarks for representation learning of code.
https://arxiv.org/abs/1909.09436
MIT License
2.18k stars 385 forks source link

question: calculating mrr and loss #202

Closed sedflix closed 4 years ago

sedflix commented 4 years ago

I'm making a model for this using my own codebase. I wanted to confirm if this way of calculating loss and mrr is correct.

def softmax_loss(y_true, y_pred):
    q, c = y_pred
    similarity_score = tf.matmul(q, K.transpose(c))
    per_sample_loss = tf.nn.sparse_softmax_cross_entropy_with_logits(
        logits=similarity_score,
        labels=tf.range(q.shape[0])
    )
    return tf.reduce_sum(per_sample_loss) / q.shape[0]

def mrr(y_true, y_pred):
   q, c = y_pred
   similarity_scores = tf.matmul(q, K.transpose(c))
   correct_scores = tf.linalg.diag_part(similarity_scores)
   compared_scores = similarity_scores >= tf.expand_dims(correct_scores, axis=-1)
   compared_scores = tf.cast(compared_scores, tf.float16)
   return K.mean(tf.math.reciprocal(tf.reduce_sum(compared_scores, axis=1)))

Here, q and c are query and code feature vector of shape (batch_size, vector_dimension). I'm a bit hesitant because I feel that mrr and loss depend on the batch_size and kind of example in the batch(closely related(mrr might be low) or far apart(mrr can be high)).

EDIT: I've looked around some issues already present. Is my understanding correct, during testing, we will have a batch_size of 1000 and we won't be shuffling the data?

mallamanis commented 4 years ago

Hi @sedflix a) There are multiple possible losses that one could use. The one you have here, seems correct. We also have a few more loss functions here. b) Your computation of MRR also seems correct (and closely matches the one we have here.

However, there are two types of testsets: a) One is the public testset (the portion of the dataset that should be used to evaluate the intrinsic performance). We have been using MRR for this. However, this is not the test data/metric used in the leaderboard. b) There is a second, hidden testset "behind" the leaderboard. This includes natural language queries and relevance judgements by human annotators. The target score there is NDCG (and is computed using this code). The submissions that achieve higher NDCG, rank higher.

Hopefully this helps! If so, please close this issue :)