heathsc / gemBS

gemBS is a bioinformatics pipeline designed for high throughput analysis of DNA methylation from Whole Genome Bisulfite Sequencing data (WGBS).
GNU General Public License v3.0
32 stars 21 forks source link

Adapter trimming #12

Open berguner opened 6 years ago

berguner commented 6 years ago

Hello,

Is adapter trimming required or recommended for WGBS analysis using gemBS?

heathsc commented 6 years ago

Not normally, unless you are expecting to see a large fraction of DNA fragments that are less than 1 read length. In general the GEM3 mapper is quite good at mapping even with quite high levels of adaptor contamination.

Simon

On 28 Feb 2018, at 13:22, berguner notifications@github.com wrote:

Hello,

Is adapter trimming required or recommended for WGBS analysis using gemBS?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/heathsc/gemBS/issues/12, or mute the thread https://github.com/notifications/unsubscribe-auth/ADHPd0BBnwU2zavaGR_cS7WWFyGgoSo_ks5tZUURgaJpZM4SWkOy.

berguner commented 6 years ago

Hello Simon,

We have a library which was prepared for 75PE sequencing so the insert size was short. We also sequenced this library with 150PE on NovaSeq to test the instrument. As expected, there is a lot of adapter contamination in the 150PE data. Although gem3-mapper aligns most of the reads regardless of the adapter contamination, the rate of unique reads decreases considerably. For the same sample the average unique rate increases from 44% to 54% after adapter+3'quality trimming whereas it is 74% in the 75PE data. The rate for 75PE is much higher because GEM3 mapper gives 0 mapping quality to 100% overlapping pairs. Moreover, the adapters are aligned with gaps and mismatches rather than soft clipping which might -IMHO- increase the FDR of variant/bisulfite calling. Please have a look at the IGV images below where you can see the difference between 75PE, adapter trimmed 150PE and non-trimmed 150PE data (from top to bottom respectively). https://drive.google.com/file/d/1oG7Y1wRxKEjarDN5HC1qg_MkuIxpqztQ/view?usp=sharing https://drive.google.com/file/d/1yrXTMHkK1xZRz1BWcBe03Qit7BeBr-Fa/view?usp=sharing

Best, Bekir

heathsc commented 6 years ago

Hi Bekir,

Thanks for the report. The libraries that we see mostly have many pairs where the two reads overlap, but very few where the insert size is less than a single read-length. In this case GEM3 will not set bit 2 in the SAM flags (indicating ‘proper’ alignment), and the read pair will be discarded by bs_call. I could add a flag to the caller to not discard such read pairs and that should (hopefully) resolve many of the problems that you are seeing.

Simon

On 6 Mar 2018, at 10:17, berguner notifications@github.com wrote:

Hello Simon,

We have a library which was prepared for 75PE sequencing so the insert size was short. We also sequenced this library with 150PE on NovaSeq to test the instrument. As expected, there is a lot of adapter contamination in the 150PE data. Although gem3-mapper aligns most of the reads regardless of the adapter contamination, the rate of unique reads decreases considerably. For the same sample the average unique rate increases from 44% to 54% after adapter+3'quality trimming whereas it is 74% in the 75PE data. The rate for 75PE is much higher because GEM3 mapper gives 0 mapping quality to 100% overlapping pairs. Moreover, the adapters are aligned with gaps and mismatches rather than soft clipping which might -IMHO- increase the FDR of variant/bisulfite calling. Please have a look at the IGV images below where you can see the difference between 75PE, adapter trimmed 150PE and non-trimmed 150PE data (from top to bottom respectively). https://drive.google.com/file/d/1oG7Y1wRxKEjarDN5HC1qg_MkuIxpqztQ/view?usp=sharing https://drive.google.com/file/d/1oG7Y1wRxKEjarDN5HC1qg_MkuIxpqztQ/view?usp=sharing https://drive.google.com/file/d/1yrXTMHkK1xZRz1BWcBe03Qit7BeBr-Fa/view?usp=sharing https://drive.google.com/file/d/1yrXTMHkK1xZRz1BWcBe03Qit7BeBr-Fa/view?usp=sharing Best, Bekir

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/heathsc/gemBS/issues/12#issuecomment-370715180, or mute the thread https://github.com/notifications/unsubscribe-auth/ADHPd4pH5Mj_ynULyTAXJV9ClxxtDIFaks5tblQXgaJpZM4SWkOy.

berguner commented 6 years ago

Thank you Simon, it would be great if you can implement this soon.

Best, Bekir

berguner commented 6 years ago

Hello Simon,

I could add a flag to the caller to not discard such read pairs (100% overlapping)

Did you have a chance to implement this feature?

Best, Bekir

heathsc commented 6 years ago

Hi Bekir,

In principle the new version of gemBS (3.0.0) should have this feature. The --keep-unmatched option to gemBS call will not discard read pairs that are not pairing correctly.

Simon