Use better evaluation metrics

Right now, we compute the Hit@K alongside the accuracy (% of eval queries that are closer to their positive document than the negative from the triplet). Hit@k requires computing the similarity scores for all pairs in the eval dataset, which is already very expensive for dense models given large eval set, but is prohibitively expensive using maxsim.

Better and tractable metrics to monitor the training would be better.