heathsc / gemBS

gemBS is a bioinformatics pipeline designed for high throughput analysis of DNA methylation from Whole Genome Bisulfite Sequencing data (WGBS).
GNU General Public License v3.0
32 stars 21 forks source link

Alignment mode for RRBS #40

Open berguner opened 5 years ago

berguner commented 5 years ago

Hello,

RRBS protocol relies on digestion based enrichment using certain restriction enzymes such as Taq1 and MspI. These enzymes recognize and cut at certain DNA patterns (T-CGA, C-CGG) so we expect most of the reads to be mapped on corresponding genomic regions. Is it possible to implement a feature in the gem-mapper to prioritize these regions? I think it is crucial for gemBS to have such feature to support the claim that it is supporting RRBS analysis.

Best, Bekir

heathsc commented 5 years ago

Hi Bekir,

We would be quite happy to add this, but I need a bit more information. How would you propose that the prioritization should work - through resolving multimaps ? With RRBS do we expect each read fragment to start strictly at a restriction site?

Cheers, Simon

On 28 Nov 2018, at 09:50, berguner notifications@github.com wrote:

Hello,

RRBS protocol relies on digestion based enrichment using certain restriction enzymes such as Taq1 and MspI. These enzymes recognize and cut at certain DNA patterns (T-CGA, C-CGG) so we expect most of the reads to be mapped on corresponding genomic regions. Is it possible to implement a feature in the gem-mapper to prioritize these regions? I think it is crucial for gemBS to have such feature to support the claim that it is supporting RRBS analysis.

Best, Bekir

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/heathsc/gemBS/issues/40, or mute the thread https://github.com/notifications/unsubscribe-auth/ADHPd7HjwzAVAIPwgf3yExkkuX7vq07oks5uzk5lgaJpZM4Y3H-q.

berguner commented 5 years ago

Hi Simon,

I am not really an expert on this topic but I will try to explain a bit more. We have been using BSMAP which basically restricts the genome into the regions starting with restriction patterns. Therefore it does not map outside these regions and assigns the reads which were not enriched as unmapped. This is not really the ideal solution because it does not report mappings outside of restrictions sites and also does not report whether reads are multimaps or not. However it has better performance because it only searches a fraction of the genome and we can evaluate how well had the enrichment worked based on the amount of reads aligned onto the restriction sites.

In my opinion a good solution should be able to: 1- Differentiate reads starting with the restriction patterns and map them accordingly, 2- Also map un-enriched (non-restricted) reads, 3- Correctly score/resolve multimaps, 4- Report enrichment rates based on the amount of reads conforming to enrichment criteria (starting with restriction pattern and also mapping onto the restriction sites of the genome).

We can imagine the reads starting with the restriction pattern (MspI:CGG/TGG, Taq1:CGA/TGA) are carrying an extra base at the beginning (MspI:C-CGG/C-TGG, Taq1:T-CGA/T-TGA) and map them together with this imaginary base. These patterns should be given as program arguments. I believe adding this imaginary base should help prioritize the restriction sites as correct mapping regions and also should not disrupt the resolving of multimaps.

In our facility we primarily use MspI so most of the reads carry CGG/TGG pattern at their start position. In a good sample where the source DNA was not degraded more than %95 of the reads carry this pattern, so we can assume that they come from digested DNA fragments. You can see this on the FastQC per-base sequence content plot: image

I hope this helps a little.

Best, Bekir

heathsc commented 5 years ago

Hi Bekir,

OK, I have some ideas on this and I will try them out with a few RRBS datasets. We can clearly add mapping statistics counting the number of reads mappings to a restriction site and those mapping elsewhere. We can also modify the MAPQ calculation so that if a read map equally well to two locations, one of which starts with the restriction site, the location with the restriction site would be output as the primary mapping with a MAPQ of somewhere around 10-15. This could be justified if we say that we expect ~95% of reads to be in the enriched fraction.

It’s not clear exactly what BSMAP is doing. The documentation is sparse and from the code their way of handling the RRBS appears to be very adhoc so I wouldn’t want to follow their example. When you say that BSMAP has better performance do you mean that it finds more uniquely mapping reads?

Simon

On 28 Nov 2018, at 11:34, berguner notifications@github.com wrote:

Hi Simon,

I am not really an expert on this topic but I will try to explain a bit more. We have been using BSMAP which basically restricts the genome into the regions starting with restriction patterns. Therefore it does not map outside these regions and assigns the reads which were not enriched as unmapped. This is not really the ideal solution because it does not report mappings outside of restrictions sites and also does not report whether reads are multimaps or not. However it has better performance because it only searches a fraction of the genome and we can evaluate how well did the enrichment worked based on the amount of reads aligned onto the restriction sites.

In my opinion a good solution should be able to: 1- Differentiate reads starting with the restriction patterns and map them accordingly, 2- Also map un-enriched (non-restricted) reads, 3- Correctly score multimaps, 4- Report enrichment rates based on the amount of reads conforming to enrichment criteria (starts with restriction pattern and also maps onto the genome the restriction pattern).

We can imagine the reads starting with the restriction pattern (MspI:CGG/TGG, Taq1:CGA/TGA) are carrying an extra base at the beginning (MspI:C-CGG/C-TGG, Taq1:T-CGA/T-TGA) and map them together with this imaginary base. These patterns should be given as program arguments. I believe adding this imaginary base should help prioritize the restriction sites as correct mapping regions and also should not disrupt the resolving of multimaps.

In our facility we primarily use MspI so most of the reads carry CGG/TGG pattern at their start position. In a good sample where the source DNA was not degraded more than %95 of the reads carry this pattern, so we can assume that they come from digested DNA fragments. You can see this on the FastQC per-base sequence content plot: https://user-images.githubusercontent.com/23213155/49142710-bc9d3d00-f2f9-11e8-8b93-d8aa49a64b09.png I hope this helps a little.

Best, Bekir

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/heathsc/gemBS/issues/40#issuecomment-442399426, or mute the thread https://github.com/notifications/unsubscribe-auth/ADHPdxnQFx3YpdGFlPQnNI58XeaoUlFHks5uzmaagaJpZM4Y3H-q.

berguner commented 5 years ago

Hello Simon,

Thank you for looking into this, I think your solution will solve most of our problems.

BSMAP requires less CPU time and less memory when it is run in digestion/restriction mode in comparison to running in whole genome mode. In my experience, gem3-mapper was faster than BSMAP even if it was running in digestion mode. gem3-mapper requires more memory for searching whole genome, however this is not really a problem in our computing infrastructure. I would much prefer whole genome mapping capability rather than smaller memory footprint. So, I am not really concerned about the performance of gemBS.

Best, Bekir

berguner commented 5 years ago

Hello Simon,

Did you get the chance to implement this feature?

heathsc commented 5 years ago

It is one of the projects I am working on at the moment. I will reply to this thread when I have something to test.

Simon

On 27 Mar 2019, at 10:02, berguner notifications@github.com wrote:

Hello Simon,

Did you get the chance to implement this feature?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/heathsc/gemBS/issues/40#issuecomment-477048114, or mute the thread https://github.com/notifications/unsubscribe-auth/ADHPd6QCFW01Jib3rme2l4R4X2r9DTy5ks5vazOjgaJpZM4Y3H-q.

berguner commented 5 years ago

That's great! If you need more data for testing I can always provide.

Best, Bekir

heathsc commented 5 years ago

If you have some testing datasets that would be great. It need not be an entire dataset - just 10 millions reads or so - and ideally one single end and one paired end.

Simon

On 27 Mar 2019, at 10:07, berguner notifications@github.com wrote:

That's great! If you need more data for testing I can always provide.

Best, Bekir

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/heathsc/gemBS/issues/40#issuecomment-477050096, or mute the thread https://github.com/notifications/unsubscribe-auth/ADHPd3ZeTAJMN3imaP8C-QQnlVEPnfrZks5vazTLgaJpZM4Y3H-q.

berguner commented 5 years ago

You can use the data here: test_data Paired reads are 75bp and the single end reads are 50bp. The digestion enzyme is MspI so the digestion pattern is C-CGG.

Best, Bekir

berguner commented 4 years ago

Hi,

Is there any progress for this feature request? There are some large RRBS projects coming up and I would love to ditch our old pipeline and switch to gemBS for good.

Thanks again for looking into this, Cheers, Bekir