heathsc / gemBS

gemBS is a bioinformatics pipeline designed for high throughput analysis of DNA methylation from Whole Genome Bisulfite Sequencing data (WGBS).
GNU General Public License v3.0
32 stars 21 forks source link

Alignment Modes #91

Closed HarryZhang1224 closed 2 years ago

HarryZhang1224 commented 2 years ago

Hello,

Can you please elaborate on the difference between the different alignment modes(fast, sensitive, customed)?

Thanks, Harry

JakeLehle commented 2 years ago

This should be listed in the paper and the manual. Give me a few hours and I'll pull it up after dinner.

While I look, are you familiar with 3 letter aligners and the differences between a dna methylation aligner like gemBS and bwa-meth?

This will save me from typing a bunch of stuff depending on how familiar you are.

HarryZhang1224 commented 2 years ago

This should be listed in the paper and the manual. Give me a few hours and I'll pull it up after dinner.

While I look, are you familiar with 3 letter aligners and the differences between a dna methylation aligner like gemBS and bwa-meth?

This will save me from typing a bunch of stuff depending on how familiar you are.

Yes I am familiar with those. Thank you so much!

heathsc commented 2 years ago

In principle the 3 letter aligners should have better sensitivity (i.e., they could find a unique mapping where gemBS or bwa-meth find two mappings with equal score). In practice however this is rare with reads of 75bp or more, and it is not clear that this behaviour of the 3 letter aligners is an advantage (we could actually add this to gem3 by filtering the mappings using the original sequence & reference but so far have decided not to), because it introduces a methylation specific bias i.e., the probability that a read is uniquely aligned can depend on the methylation status of CpGs in the read.

Simon

On Tue, May 3, 2022 at 1:06 AM Harry Zhang @.***> wrote:

This should be listed in the paper and the manual. Give me a few hours and I'll pull it up after dinner.

While I look, are you familiar with 3 letter aligners and the differences between a dna methylation aligner like gemBS and bwa-meth?

This will save me from typing a bunch of stuff depending on how familiar you are.

Yes I am familiar with those. Thank you so much!

— Reply to this email directly, view it on GitHub https://github.com/heathsc/gemBS/issues/91#issuecomment-1115451051, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAY465ZN2ORUF24JJFELKM3VIBNX5ANCNFSM5U5DOBJA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

JakeLehle commented 2 years ago

Thanks @heathsc. So by changing GEM3 to sensitive there will be more gapped alignment and dynamic programming. Will this lead to an increase in the number of reads with low quality score when looking at the BAM files with flasgstats?

I've always liked that low quality reads are filtered out in the BAMs compared to bwa-meth. However, I'm curious if running the aligner on a setting other and fast would have drawbacks in the overall runtime and the quality of the reads in the BAM files output?

HarryZhang1224 commented 2 years ago

Hi Jake,

Thank you for the detailed explanation! This helps a lot. I am seeing these settings under gem-mapper --help, which listed these 3 modes under the single-end alignment mode.

Harry

On Tue, May 3, 2022 at 7:18 AM Jake Lehle @.***> wrote:

Hmmmm okay I dug around last night and I see your frustration so I'm glad you are asking this so it's easy for people to find this info I was having trouble pulling up a super clear explanation of those setting as well. So that setting has to do with the GEM3-mapper managed by @smarco https://github.com/smarco who can probably go into more detail if we ask him.

So as a short answer, as I'm sure you already know the speed from bwa and gem3 mappers comes from quickly solving for seeds with the BWA transform and then extends from those seeds to try to find EXACT matches. "Why exact?" Because aligning unique exact matches is easy and can be done without breaking a sweat (AKA using a bunch of RAM). Okay but what when there isn't an exact match and you have multiple possible hits on the reference genome, or you are doing paired alignment and the reads align to different chromosomes, or you have indels so there isn't a good match and have to do gapped alignment?

Well fear not. GEM3 and gemBS has some serious improvements over BWA in that regard. First off the complete searches through index and sorting matches into strata helps a lot in getting exact matches that are properly paired and filtering out low quality reads so that addresses those two first concerns This is something BWA struggles with and often leads to a higer number of reads having low-quality scores or aligning to different chromosomes if you look at the flagstats with samtools on the BAM files. Okay, what about gapped alignment? That's where these settings would make a difference. so BWA has a default of only allowing 2 mismatches max on a seed before throwing it out and because it only uses 2GB of RAM, it is trying everything in it's power not to do any dynamic programming to compute a possible match that would greatly slow down the processing time more than it already is. However, gemBS is way faster then bwa-meth so if you change the settings to sensitive this will probably allow for more missmatches and more gapped aligning to find more matches. How much more? I honestly couldn't find out last night. But this will probably slow things down a bunch so keep that in mind and make sure to increase RAM above the 38BG needed for doing the complete searches through the indexes.

So because I was having more trouble with my search than I thought I would have. I have a question for you. I pulled up gemBS and went through the help output. Where are you seeing these settings? I didn't see it under this command where you are mainly using the GEM3 mapper.

gemBS map -h

— Reply to this email directly, view it on GitHub https://github.com/heathsc/gemBS/issues/91#issuecomment-1116154820, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASNTAXWWNITUTOUWSJSIFSTVIEYTVANCNFSM5U5DOBJA . You are receiving this because you authored the thread.Message ID: @.***>

JakeLehle commented 2 years ago

Hi Jake, Thank you for the detailed explanation! This helps a lot. I am seeing these settings under gem-mapper --help, which listed these 3 modes under the single-end alignment mode. Harry On Tue, May 3, 2022 at 7:18 AM Jake Lehle @.> wrote: Hmmmm okay I dug around last night and I see your frustration so I'm glad you are asking this so it's easy for people to find this info I was having trouble pulling up a super clear explanation of those setting as well. So that setting has to do with the GEM3-mapper managed by @smarco https://github.com/smarco who can probably go into more detail if we ask him. So as a short answer, as I'm sure you already know the speed from bwa and gem3 mappers comes from quickly solving for seeds with the BWA transform and then extends from those seeds to try to find EXACT matches. "Why exact?" Because aligning unique exact matches is easy and can be done without breaking a sweat (AKA using a bunch of RAM). Okay but what when there isn't an exact match and you have multiple possible hits on the reference genome, or you are doing paired alignment and the reads align to different chromosomes, or you have indels so there isn't a good match and have to do gapped alignment? Well fear not. GEM3 and gemBS has some serious improvements over BWA in that regard. First off the complete searches through index and sorting matches into strata helps a lot in getting exact matches that are properly paired and filtering out low quality reads so that addresses those two first concerns This is something BWA struggles with and often leads to a higer number of reads having low-quality scores or aligning to different chromosomes if you look at the flagstats with samtools on the BAM files. Okay, what about gapped alignment? That's where these settings would make a difference. so BWA has a default of only allowing 2 mismatches max on a seed before throwing it out and because it only uses 2GB of RAM, it is trying everything in it's power not to do any dynamic programming to compute a possible match that would greatly slow down the processing time more than it already is. However, gemBS is way faster then bwa-meth so if you change the settings to sensitive this will probably allow for more missmatches and more gapped aligning to find more matches. How much more? I honestly couldn't find out last night. But this will probably slow things down a bunch so keep that in mind and make sure to increase RAM above the 38BG needed for doing the complete searches through the indexes. So because I was having more trouble with my search than I thought I would have. I have a question for you. I pulled up gemBS and went through the help output. Where are you seeing these settings? I didn't see it under this command where you are mainly using the GEM3 mapper. gemBS map -h — Reply to this email directly, view it on GitHub <#91 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASNTAXWWNITUTOUWSJSIFSTVIEYTVANCNFSM5U5DOBJA . You are receiving this because you authored the thread.Message ID: @.>

Yeah okay that makes sense. So I originally deleted my response comment since @heathsc got back to you on this. I saw that setting yesterday on the GEM3 manual page.

image

But you are right that is vague. Frankly, I would leave the setting as fast just to avoid introducing bias and slowing down the pipeline. Is there a particular reason you was wanting to play around with the alignment?

HarryZhang1224 commented 2 years ago

Not really, I am an undergraduate student currently trying to summarize some commonly used aligners in BS contexts, so I just wanted to understand the difference! Thanks for the help!

Harry

On Tue, May 3, 2022 at 11:41 AM Jake Lehle @.***> wrote:

Hi Jake, Thank you for the detailed explanation! This helps a lot. I am seeing these settings under gem-mapper --help, which listed these 3 modes under the single-end alignment mode. Harry … <#m-4245356138834778107> On Tue, May 3, 2022 at 7:18 AM Jake Lehle @.> wrote: Hmmmm okay I dug around last night and I see your frustration so I'm glad you are asking this so it's easy for people to find this info I was having trouble pulling up a super clear explanation of those setting as well. So that setting has to do with the GEM3-mapper managed by @smarco https://github.com/smarco https://github.com/smarco https://github.com/smarco who can probably go into more detail if we ask him. So as a short answer, as I'm sure you already know the speed from bwa and gem3 mappers comes from quickly solving for seeds with the BWA transform and then extends from those seeds to try to find EXACT matches. "Why exact?" Because aligning unique exact matches is easy and can be done without breaking a sweat (AKA using a bunch of RAM). Okay but what when there isn't an exact match and you have multiple possible hits on the reference genome, or you are doing paired alignment and the reads align to different chromosomes, or you have indels so there isn't a good match and have to do gapped alignment? Well fear not. GEM3 and gemBS has some serious improvements over BWA in that regard. First off the complete searches through index and sorting matches into strata helps a lot in getting exact matches that are properly paired and filtering out low quality reads so that addresses those two first concerns This is something BWA struggles with and often leads to a higer number of reads having low-quality scores or aligning to different chromosomes if you look at the flagstats with samtools on the BAM files. Okay, what about gapped alignment? That's where these settings would make a difference. so BWA has a default of only allowing 2 mismatches max on a seed before throwing it out and because it only uses 2GB of RAM, it is trying everything in it's power not to do any dynamic programming to compute a possible match that would greatly slow down the processing time more than it already is. However, gemBS is way faster then bwa-meth so if you change the settings to sensitive this will probably allow for more missmatches and more gapped aligning to find more matches. How much more? I honestly couldn't find out last night. But this will probably slow things down a bunch so keep that in mind and make sure to increase RAM above the 38BG needed for doing the complete searches through the indexes. So because I was having more trouble with my search than I thought I would have. I have a question for you. I pulled up gemBS and went through the help output. Where are you seeing these settings? I didn't see it under this command where you are mainly using the GEM3 mapper. gemBS map -h — Reply to this email directly, view it on GitHub <#91 (comment) https://github.com/heathsc/gemBS/issues/91#issuecomment-1116154820>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASNTAXWWNITUTOUWSJSIFSTVIEYTVANCNFSM5U5DOBJA https://github.com/notifications/unsubscribe-auth/ASNTAXWWNITUTOUWSJSIFSTVIEYTVANCNFSM5U5DOBJA . You are receiving this because you authored the thread.Message ID: @.>

Yeah okay that makes sense. So I originally deleted my response comment since @heathsc https://github.com/heathsc got back to you on this. I saw that setting yesterday on the GEM3 manual page.

[image: image] https://user-images.githubusercontent.com/84940857/166520162-d542d944-8787-4542-9833-b26fa479ef36.png

But you are right that is vague. Frankly, I would leave the setting as fast just to avoid introducing bias and slowing down the pipeline. Is there a particular reason you was wanting to play around with the alignment?

— Reply to this email directly, view it on GitHub https://github.com/heathsc/gemBS/issues/91#issuecomment-1116436958, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASNTAXUPPSOGK7SVZU73ZWLVIFXNXANCNFSM5U5DOBJA . You are receiving this because you authored the thread.Message ID: @.***>

smarco commented 2 years ago

Hi,

Fast is the most practical approach and works pretty well. Sensitive and customed are more exotic and for research purposes (i.e., my thesis).

I hope this helps.

HarryZhang1224 commented 2 years ago

Hi,

  • fast: is the default mapping mode. Produces a very similar accuracy as bwa-mem, usually being faster than it. It performs a simple mapping pass, extracting as many seeds as possible from the input sequence and reporting the results it finds. In most cases, this mode is more than enough. But it is not guaranteed that 100% of the sequences will be mapped or their uniqueness established (i.e., we can find a match, but maybe with +1 mismatch/indel there is another match).
  • sensitive: it is significantly slower than "fast" as it resorts to a costly algorithm to find mappings (or assess uniqueness) for those sequences that the mode "fast" was not sufficient.
  • customed: this allows you to set the maximum number of errors you want to explore mappings for (how deep to search finding mappings).

Fast is the most practical approach and works pretty well. Sensitive and customed are more exotic and for research purposes (i.e., my thesis).

I hope this helps.

Thank you for your detailed explanation!