Question about the evaluation metrics

Hi, your work is impressive. I have a little question about your evaluation in https://github.com/molecule-one/megan/blob/master/bin/eval.py.

Based on my understanding, normally, for each product, a model predicts m reactants. Then the top-k is calculated as the fraction of samples whose top-k predicted reactants contain at least one ground truth reactant.

In your evaluation, it seems like for each product A, you predict several complete reactant sets, denoted as A1, A2, A3, ranked by a set score. Then, the top-k is calculated as the fraction of samples within whose top k predicted sets, there is a set that completely matches the ground truth reactant set. I am wondering if it is more strict than the normal top-k evaluation, as you need to predict all the reactants correctly? In other words, is it fair to compare this "top-k" with others' "top-k"?

Please correct me if I make anything wrong. Thanks.

Hi, let me summarize how the Top K evaluation works, so we are on the same track. For each product, the model predicts M reactions, and each predicted reaction is represented by a set of reactants (the product is already known). This set of reactants can be represented as a single SMILES string, canonicalized with RdKit, with reactants divided by the . character.

The top K metric measures what fraction of ground truth reactions is found within the first K out of M predicted reactions, on average. To measure this, we compare reactions predicted by the model to the ground truth reactions. In most other works, this is done by comparing canonical SMILES of the reactants of the predicted reaction to the canonical SMILES of the reactants of the ground truth reaction. In our code, the function prediction_is_correct does an equivalent thing, but instead of comparing canonical SMILES we create Counters of SMILES for each molecule within the reactants. We claim that this function gives exactly the same result as if you compared the canonical SMILES, it was just more convenient for us to write it this way.

Our evaluation code measures the same thing as other works we compare ourselves to, so to our best knowledge, it is a fair comparison.

I am not sure which set are you pointing to in the code exactly. We have a set final_smis, but it is only to gather already predicted reactions to deduplicate predictions.

molecule-one / megan

Question about the evaluation metrics #10