Update summary for the benchmark?

hellowangqian commented 1 year ago

Dear author, Many thanks for releasing the data and code for such a great project. I've been following the repository for a while and am trying to reproducing the results for the "benchmark". I just realised the code and results have been updated recently. I'm wondering if you could provide a summary of what has changed. That would be a big help for people like me to understand the evaluation metrics and beyond better. Particularly, I'd like to know how to interpret the negative values of mmAP in these results? Best wishes

niranjchandrasekaran commented 1 year ago

Hi @hellowangqian, I am pasting a paragraph from our updated manuscript that should be up on biorxiv in about two weeks. I hope that answers your questions. If you need further clarification, please do let me know.

Average Precision (AP), mean Average Precision (mAP) and fraction positive (fp)

Average precision (AP) is the weighted mean of precision values at each recall threshold (at each k where recall changes in a ranked list of similarity scores) for a given class. We use AP as a measure of replicability (how distinguishable are replicates of a perturbation from its reference, the negative control – i.e., are they retrieved towards the top of a list of samples ranked by similarity to the query perturbation) and of biological relevancy (how distinguishable are true compound pairs or true compound-gene pairs from their reference, false pairs – i.e., is the correct match retrieved towards the top of a list of samples ranked by similarity to the query perturbation). The definition of class varies – each perturbation is the class while computing AP for replicability, and each gene targeted by compound or genetic perturbation is the class for computing AP for biological relevancy. We measure similarity between perturbations using cosine similarity. Once computed, we adjust the AP by subtracting the 95th percentile of AP values for 10000 shuffled ranked lists of size K+P (random baseline), where K is the number of positive matches and P is the number of reference samples. Finally, we average this adjusted AP of each class member and report AP per class, termed mean Average Precision (mAP). While measuring biological relevancy, we filter out those perturbations that are not replicable (mAP<0) and remove classes with only a single member. We then summarize the mAP values of a task by calculating the fraction of values that are greater than 0 (95th percentile of the random baseline), termed fraction positive (fp).

hellowangqian commented 1 year ago

Hi @niranjchandrasekaran, many thanks for your prompt reply. This clarifies most of my confusion but the one I added in the last sentence of my original post: why the mAP/mmAP are negative values? How to interpret these results? According to the definition of mAP, it should be always positive and even being subtracted by the baseline, it should be close to 0 rather than being "very negative" (e.g., <-0.1). Is there anything I miss here? BTW, looking forward to your updated manuscript. :o)

niranjchandrasekaran commented 1 year ago

Hi @hellowangqian, the way we define our baseline, which is the 95th percentile of AP values of 10k randomly ranked lists, we do end up with baseline AP values that can sometimes be larger than that of the treatments. That results in AP values being negative. The way we think about them is that any treatment with an AP value that is less than or equal to zero is indistinguishable from the reference (either negative control or other treatments in the experiment). We are working on an alternate approach that would avoid these negative AP values and is more interpretable, but that was not used in this manuscript/repo.

hellowangqian commented 1 year ago

Great explanation. Everything is clear now. Thank you again.

badtom commented 1 year ago

Hi, perhaps you can simply use the AP values minus the performance of a random classifier (which is the percentage of positive points in the query set) similarly as done for areas under the precision-recall curves (AUPRC) in https://openreview.net/pdf?id=701FtuyLlAd . In this way you need no shuffled ranked lists and from averaging over classes you will see if a method is performing better than a random baseline on average (i.e. the average is > 0).

niranjchandrasekaran commented 1 year ago

perhaps you can simply use the AP values minus the performance of a random classifier (which is the percentage of positive points in the query set) similarly as done for areas under the precision-recall curves (AUPRC)

Hi @badtom, we initially used that approach to correct the mAP values but found that the percentage of positive points was closer to the median of the distribution of mAP values of the random classifier. We also found that the distribution had a long tail, and using the median to summarize that distribution resulted in too many false positives. Hence, we switched over to using the 95th percentile of the distribution.

badtom commented 1 year ago

Thank you for the information. I did some search on the topic and found out that, as opposed to AUPRC (https://itnext.io/the-baseline-for-precision-recall-curve-a-bayesian-approach-1611c690607) for AP the percentage of positive points is only an approximate performance of random baseline, but some corrections can be computed as discussed in https://ufal.mff.cuni.cz/pbml/103/art-bestgen.pdf. Perhaps this is the reason for the discrepancies you observed? Also it is not clear for me what you mean by false positives in the above comment, hope I will better understand it from the updated manuscript.

niranjchandrasekaran commented 1 year ago

Hi @badtom, thank you for sharing those links. The manuscript that is associated with this repo will likely not include all the details about the correction, but another postdoc in the group is currently working on a manuscript that focuses on mAP and the correction. I will make a note to share it with you once it is up on biorxiv.

niranjchandrasekaran commented 5 months ago

Hi @badtom, the manuscript that I previously mentioned is now up on biorxiv: https://www.biorxiv.org/content/10.1101/2024.04.01.587631v1. Since my last comment, we have updated our approach to calculating mAP. I hope this manuscript clarifies some of your questions.

jump-cellpainting / 2024_Chandrasekaran_NatureMethods

Update summary for the benchmark? #75

Average Precision (AP), mean Average Precision (mAP) and fraction positive (fp)