huishenlab / biscuit

BISulfite-seq CUI Toolkit
Other
16 stars 7 forks source link

BAM file as the input #35

Closed bounlu closed 1 year ago

bounlu commented 1 year ago

Hello,

Does the BISCUIT accept BAM file from other aligners as the input, such as bismark which uses bowtie2?

Thanks.

jamorrison commented 1 year ago

Hi,

Yes, biscuit can accept SAM/BAM-compliant files from other aligners.

Cheers, Jacob

Get Outlook for Androidhttps://aka.ms/AAb9ysg


From: Ömer An @.> Sent: Wednesday, March 15, 2023 3:47:09 AM To: huishenlab/biscuit @.> Cc: Morrison, Jacob @.>; Assign @.> Subject: [External] [huishenlab/biscuit] BAM file as the input (Issue #35)

Hello,

Does the biscuit accept BAM file from other aligners as the input, such as bismark which uses bowtie2?

Thanks.

— Reply to this email directly, view it on GitHubhttps://github.com/huishenlab/biscuit/issues/35, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB3M4YEFDVBSSYCAJBWKJXTW4FXX3ANCNFSM6AAAAAAV3NVSCY. You are receiving this because you were assigned.Message ID: @.***>

CAUTION: This email was sent from outside of the organization @.). Do not click links or open attachments unless you recognize the sender and know the content is safe. If you have any questions, please contact @*.**@*.***>.

bounlu commented 1 year ago

Thanks for the prompt reply.

Related to that, may I also ask which version of the bam file should be provided to BISCUIT?

1. *.bam
2. *.deduplicated.bam
3. *.deduplicated.sorted.bam
jamorrison commented 1 year ago

You would want to use 3. *.deduplicated.sorted.bam as input. You'll also want to index your sorted BAM before running biscuit.

Note, if you marked duplicates with another tool other than Bismark (samblaster, picard, etc.), as long as that BAM was sorted and indexed, you could use that BAM as input to biscuit. BISCUIT ignores duplicate marked reads by default.

bounlu commented 1 year ago

I also guessed so, however there are 2 concerns:

  1. Bismark does not mark duplicates but actually removes them (unlike picard MarkDuplicates), so I was wondering if this would affect biscuit in any way. For example, QC.sh dup_report will always display "Number of duplicate reads" as 0.

  2. By default, Bismark deduplicated BAM is not position-sorted, as the subsequent bismark_methylation_extractor requires name-sorted BAM. Therefore, it needs to be position-sorted as a separate step to feed into biscuit which is a heavy step.

Indexing is the easy part.

jamorrison commented 1 year ago

Hopefully the following addresses your concerns:

  1. For biscuit pileup -> biscuit vcf2bed to extract methylation or SNPs, it's okay that duplicate marked reads have been removed. The default behavior is to skip these reads, so even if you marked duplicates with picard MarkDuplicates, they wouldn't be included in that case either. For the specific case of QC.sh, since the duplicates have been removed, the script can't register any duplicates and you'll get the "correct" answer of 0. If you need to find the duplicate rate in your data, you'll have to retain those duplicates either with Bismark (if it allows it) or by marking duplicates with a different tool.
  2. While Bismark may only need name-sorted inputs to extract methylation, BISCUIT needs a position-sorted BAM. This is in order to make use of the random access provided by the BAM index, which allows for both parallel processing of the BAM and accessing specific regions of the genome (e.g., the -g option in biscuit pileup). I'm not sure what your specific use case is, but the biscuitBlaster pipeline (https://huishenlab.github.io/biscuit/biscuitblaster/#version-1) will do alignment, duplicate marking, and coordinate-sorting in a one-liner so you get a BAM that's ready for input to biscuit pileup (assuming you do the quick process of samtools index after the one-liner).