neurogenomics / EpiCompare

Comparison, benchmarking & QC of epigenetic datasets
https://doi.org/doi:10.18129/B9.bioc.EpiCompare
13 stars 3 forks source link

Add precision/recall plots #80

Closed bschilder closed 2 years ago

bschilder commented 2 years ago

For peak files where the necessary info is included, we can make precision/recall plots like this, using the reference as the "ground truth":

image

@leylaabbasova could you point us to where you wrote the code for this?

bschilder commented 2 years ago

@NathanSkene Got it! The key was going back to the papers and understanding how each peak caller filters their peaks.

I've written a function (plot_precision_recall, and its subfunction precision_recall) which can generate precision-recall plots (and F1 plots as a bonus) for any combination of peakfiles and reference files.

It works with peaks called with any of the following: MACS2, MACS3, HOMER, and SEACR

In short, it works by iterating through increasingly stringent thresholds (the exact thresholds can be adjusted by users with several function args) and computing percentiles for the relevant metrics in each GRanges object, then computes precision-recall between the peakfiles vs. reference files (all combinations). I also parallelized some steps to speed things up further for many comparisons.

NOTE: It cannot compute precision-recall for thresholds that are less stringent that the one used to call the peaks originally. This is because those peaks that didn't reach the peak caller's threshold criterion were dropped from the files entirely (thus we don't have access to them in EpiCompare). Of course, anyone can always go back and rerun their peak calling with the most lenient thresholds and then rerun plot_precision_recall across the full range of thresholds.

By default, SEACR uses a threshold of 0.5 when you don't supply an IgG control file. Basically, it just takes the top 50% strongest peaks:

...a numeric threshold n between 0 and 1 returns the top n fraction of peaks based on total signal within peaks.

In light of this, it makes a lot of sense that our SEACR peaks have a max recall of almost exactly 50%. I suspect if we used a less stringent threshold when calling peaks (or perhaps used an IgG control) we would see greater recall (perhaps at the cost of some precision). Much like the plots in the initial post in this Issue indicate.

Example

data("CnR_H3K27ac")
data("CnT_H3K27ac")
data("encode_H3K27ac")
peakfiles <- list(CnR_H3K27ac=CnR_H3K27ac, CnT_H3K27ac=CnT_H3K27ac)
reference <- list("encode_H3K27ac" = encode_H3K27ac)

pr_out <- plot_precision_recall(peakfiles = peakfiles,
                                reference = reference)

plot_zoom_png 54fb63fa-ad56-4066-84c7-2c06cfc78a95

In addition to returning both plots in a list, it also returns the processed data used to make the plots:

Screenshot 2022-06-02 at 19 18 52

bschilder commented 2 years ago

@serachoi1230 I've pushed these new functions to the master branch (with unit tests), but I'll leave it to you to see where within the EpiCompare pipeline you've like to integrate it. Let me know if there's issues or if I need to modify the functions further.

NathanSkene commented 2 years ago

Would this not kinda invalidate our CUT&TAG preprint? E.g. the reason that we detect at most 50% of ENCODE peaks is that we're only detecting the 50% most significant peaks?

bschilder commented 2 years ago

Would this not kinda invalidate our CUT&TAG preprint? E.g. the reason that we detect at most 50% of ENCODE peaks is that we're only detecting the 50% most significant peaks?

Following up on our Zoom call earlier, I don't believe this invalidates our preprint results bc Leyla ran SEACR with a very low threshold (0.01) and went up from there.

@NathanSkene @SarahMarzi also pointed out a bug where i had accidentally swapped precision/recall. I've gone back and you were indeed right, thanks for catching this. This should be correct now.

Ok I keep going back and forth on this, and now I'm thinking I was correct the first time: as threshold becomes becomes less stringent, fewer peaks are called, recall (# reference peaks in sample peaks / total reference peaks) goes up, and precision (# sample peaks in reference peaks / total sample peaks) goes down. Which is what the plots are showing above. Does that sound right to you @SarahMarzi ?

Perhaps the one part of the plots that is confusing is the far left othe PR curve where both precision/recall drop to 0. I think this is happening bc at that ultra high threshold, only 1 peak is included. When that 1 peak happens not to be in ENCODE, precision plummets to 0%. If it does happen to be in ENCODE, precision is always 100%. So perhaps I need to adjust the thresholds to not include 1.

@serachoi1230 I've pushed these new functions to the master branch (with unit tests), but I'll leave it to you to see where within the EpiCompare pipeline you've like to integrate it. Let me know if there's issues or if I need to modify the functions further.

I've taken care of intergrating the PR-curves into EpiCompare's report and added the argument precision_recall_plot (default=FALSE).

bschilder commented 2 years ago

Removing the threshold of 1 (only include the 1 very strongest peak) from the list of thresholds to test seems to resolve the issue: adjusted