Why using f1 for evalution for common_concept?

Hi, thanks for your great job. I have read the paper and reviewed the code, and I find that it uses f1 as the metric for the common_concept dataset. However, the evaluate function utility.get_multi_answer_f1 seems to be a bug in this case. For example, in this implementation, it will receive two words like "[Yes, Yes]" as prediction and ground truth for calculating the f1 metric rather than a word list like "[[Yes, Yes, No], [No, Yes, No]]". It will not return the right answer of f1 metric I think, because each sample of the test data is calculated separately. Could you please give me an answer? Thanks!

In addition, It seems you use the EA as a metric for the common_concept dataset like in Figure 4, it is not the same as in your code (refer to following snapshots).

keirp / automatic_prompt_engineer

Why using f1 for evalution for common_concept? #14