gstricker / GenoGAM

http://bioconductor.org/packages/devel/bioc/html/GenoGAM.html
7 stars 4 forks source link

Does GenoGAM pay attention to the SAM flags field for marking duplicates or should duplicates be completely removed from the input? #3

Open RichardJActon opened 4 years ago

RichardJActon commented 4 years ago

Hi I'm just preparing some data for analysis with GenoGAM and i'm removing PCR duplicates from my data I read the documentation and didn't see anywhere where it stated if GenoGAM respects the duplicate flag in SAM/BAM files or if it should be supplied with BAMs with duplicate reads filtered out entirely. Could you clarify this?

I can always take removing duplicates option to be safe but I prefer not drop information from my BAMs if I don't have to - saves time if I need it later.

Thanks for developing GenoGAM!

gstricker commented 4 years ago

Hi Richard,

thank you for using GenoGAM and sorry for the late reply, I'm busy all week in my current occupation.

GenoGAM uses Rsamtools ScanBamParam function to read in the BAM files. By default, it reads in the 'pos' and 'qwidth' field and takes otherwise the default setting of this function which uses isDuplicate=NA that resolves to "any (NA) reads should be returned".

However, you can use the GenoGAMSettings object to supply your own ScanBamParam object with the flag activated. Looking at the ScanBamParam function documentation this might be something like:

settings <- GenoGAMSettings(bamParams = Rsamtools::ScanBamParam(what = c("pos", "qwidth"), flag = Rsamtools::scanBamFlag(isDuplicate = TRUE)))

I haven't tried it out, so please let me know if that works for you.

RichardJActon commented 4 years ago

Thanks @gstricker I'll give that a try if I get the chance and let you know the outcome - for now I already put deduplication of my BAMs in my pipeline before GenoGAM. Thanks for letting me know that is possible.