Add "expected nearby" precision/recall/F1 to model evaluation statistics

When evaluating models before release, we should add some measurements of the "expected nearby" label. Recall - is it being applied when it should be, Precision - is it being applied only when it should be, and F1 - is there a good balance of the two. This could be measured as:

Recall: If the ground truth is in the set of Expected Nearby suggestions, record Recall for that observation as 1, otherwise record it as 0.

Precision: If the ground truth is not in the set of Expected Nearby suggestions, record Precision as zero, otherwise record it as 1 divided by the total number of Expected Nearby suggestions returned.

F1: If the precision and recall are both 0, record f1 as 0. Otherwise calculate f1 as the harmonic mean of precision and recall as: f1 = (2 precision recall) / (precision + recall)

inaturalist / iNaturalistMLWork

Add "expected nearby" precision/recall/F1 to model evaluation statistics #20