nyu-dl / dl4marco-bert

BSD 3-Clause "New" or "Revised" License
476 stars 87 forks source link

Estimator Scores #23

Closed anshoomehra closed 5 years ago

anshoomehra commented 5 years ago

Hi There:

I was trying to understand the output from the estimator, i.e. the below line of code:

result = estimator.predict(input_fn=eval_input_fn, yield_single_examples=True)

The outcome is as follows for a simgle record:

(array([-0.1572365, -1.9275929], dtype=float32), 1)

The first array is probabilities, the second is a label.

I was expecting only one probability score, why there are two? I see that they sum up to 1, perhaps one is confidence score, second is (1- confidence)? If that is correct, which indices of probabilities among two to be used as prediction confidence ??

I am guessing index[0] is confidence score, however looking at results am sort of bit confused, plus looking at the code I see the code uses index[1] for evaluation -- adding even more ambiguity & hence I thought to verify.

Thanks! Anshoo

rodrigonogueira4 commented 5 years ago

The two numbers are the output logits on a neural network. They are the scores of document being non-relevant and relevant to a query, respectively. In your example, because the first number (-0.1572) is higher than the second (-1.9275) the document was predicted as non-relevant to the query.

If you apply a softmax normalization to this array (tf.nn.softmax in tensorflow or https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.softmax.html for numpy), you will end up with the probabilities of being non-relevant and relevant, respectively, which will sum to 1.

anshoomehra commented 5 years ago

Thank you so much for the explanation @rodrigonogueira4 , follow up Qs:

Outputs sort of still a mystry to above explanation, is there a thresholding coming in play?? Below are few more examples, if they help analyze:

[(array([-0.1572365, -1.9275929], dtype=float32), 1), (array([-0.28537616, -1.3932444 ], dtype=float32), 0), (array([-0.1607603, -1.9071442], dtype=float32), 1), (array([-0.15765752, -1.9251235 ], dtype=float32), 0), (array([-0.15768543, -1.92496 ], dtype=float32), 1), (array([-1.5580059 , -0.23642576], dtype=float32), 0), (array([-0.15837449, -1.9209356 ], dtype=float32), 1), (array([-0.19148138, -1.7471783 ], dtype=float32), 0), (array([-0.15796915, -1.9233006 ], dtype=float32), 1), (array([-0.1828588, -1.7890775], dtype=float32), 0)]

rodrigonogueira4 commented 5 years ago

Does the first logit pertain to Non-Relevant score ?? Yes.

If above is true, shouldn't the predicted label be output as '0' ?? No. labels are not the predicted labels, they are the ground-truth labels. They are used to compute the metrics:

          gt = set(list(np.where(labels > 0)[0]))

          all_metrics += metrics.metrics(
              gt=gt, pred=pred_docs, metrics_map=METRICS_MAP)

To get the predicted labels, you could do:

predicted_labels = log_probs.argmax(1)

predicted_labels will be an array of size batch_size with zeros and ones.

anshoomehra commented 5 years ago

@rodrigonogueira4, you rock! Thank you so much! This solves all the mystery...

rodrigonogueira4 commented 5 years ago

Great! Please let me know if you have any other question.