nanoporetech / modkit

A bioinformatics tool for working with modified bases
https://nanoporetech.com/
Other
130 stars 7 forks source link

modkit pileup and summary difference #259

Closed SimonChen1997 closed 5 hours ago

SimonChen1997 commented 1 week ago

Hi,

It's my first time doing DNA methylation, I'm a little bit confused about the results from the pileup and summary.

  1. Are the pileup results from the pass_frac in the summary? how do the results from pileup and summary talk to each other?
  2. are pileup and summary using the same default settings to filter failed modified frac, like "--filter-threshold" and "--mod-thresholds"? (I know the default "--filter-percentile" for both are the same.)

Cheers, Ziming

ArtRand commented 2 days ago

Hello @SimonChen1997,

  1. Are the pileup results from the pass_frac in the summary? how do the results from pileup and summary talk to each other?

Modkit summary works at the read-level providing counts of the number of modified bases and the estimated primary base thresholds. The threshold estimation algorithm is equivalent to what is used in pileup (and sample-probs). A few key differences are:

  1. modkit summary command doesn't require mapped reads.
  2. modkit summary is only really designed to work on a sampling (i.e. a subset) of the reads in the modBAM.

If you want an complete count of the fraction of base modification calls that pass the filter threshold, I would generate the pileup table and perform an awk or similar aggregation. I could imagine implementing a summary-from-pileup command but that's currently not available.

  1. are pileup and summary using the same default settings to filter failed modified frac, like "--filter-threshold" and "--mod-thresholds"? (I know the default "--filter-percentile" for both are the same.)

Yes, all of the defaults are the same throughout modkit with the exception of point (1) above. To force modkit summary to use mapped bases, use --only-mapped.

SimonChen1997 commented 5 hours ago

Hello @SimonChen1997,

  1. Are the pileup results from the pass_frac in the summary? how do the results from pileup and summary talk to each other?

Modkit summary works at the read-level providing counts of the number of modified bases and the estimated primary base thresholds. The threshold estimation algorithm is equivalent to what is used in pileup (and sample-probs). A few key differences are:

  1. modkit summary command doesn't require mapped reads.
  2. modkit summary is only really designed to work on a sampling (i.e. a subset) of the reads in the modBAM.

If you want an complete count of the fraction of base modification calls that pass the filter threshold, I would generate the pileup table and perform an awk or similar aggregation. I could imagine implementing a summary-from-pileup command but that's currently not available.

  1. are pileup and summary using the same default settings to filter failed modified frac, like "--filter-threshold" and "--mod-thresholds"? (I know the default "--filter-percentile" for both are the same.)

Yes, all of the defaults are the same throughout modkit with the exception of point (1) above. To force modkit summary to use mapped bases, use --only-mapped.

Thanks! That makes more sense to me now 😄!