mbassalbioinformatics / CnRAP

Analytical pipeline developed to anlayze Cut and Run data. Inspired by both Henikoff (SEACR) and Orkin (Cut&RunTools) lab pipelines.
GNU General Public License v3.0
18 stars 7 forks source link

How to normalize the hg38 bam based on the sacCer bam and then get normalized bam instead of bedGraph #1

Closed pastaxingxin closed 1 year ago

pastaxingxin commented 3 years ago

Thanks for your tool, and it really helps! I see your tool converts the bam files to bedgraph files while normalizing them based on the reads to the sacCer genome. I'm wondering how to converts the bam files to normalized bam instead of bedgraph. Please advise if possible. Thanks.

mbassalbioinformatics commented 3 years ago

hey as far as im aware there is no way to normalize the bam files, they are what they are.

What you can do though is convert the normalized bedgraphs to bam files (https://bedtools.readthedocs.io/en/latest/content/tools/bedtobam.html) but doing so you loose all the CIGAR information from the original bams as that's not carried over to the bedgraph conversion so its not an ideal solution if were honest.

If you want to view the normalized begraphs, you can convert them to coverage bigwig files using (https://www.encodeproject.org/software/bedgraphtobigwig/).

The best solution will come down to what is your final goal really?

pastaxingxin commented 3 years ago

Thanks for your detailed reply. I see your script "Calling SEACR on Nuclear files" and "Calling SEACR on Regular files", I am not sure the meanings of nuclear files and regular files. I suppose nuclear files are normalized target bedgraph like human H3K4me1, and regular files are input bedgraph like igG, right?

Yes, after using your script, we could get peak bed files based on SEACR. However, I do have difficulty doing identifying differential peaks. ①Diffbind requires me to provide peak files (in my case, H3K4me3) and two bam files (in my case, H3K4me3 bam and IgG bam) for one entry. After your scripts, I have four bam files (hg38 H3K4me3, saccer3 H3K4me3, hg38 IgG and saccer3 IgG), so I don't know how to integrate these four bams into one H3K4me3 and one IgG as input in Diffbind.

Another way is identifying differential peaks based on bedgraph files, and SICER2(https://zanglab.github.io/SICER2/) may work in this scenario. For example, I wanted to identify differential H3K4me3 between HEK293 and K562. After your scripts, I got four normalized bed files after scaling by saccer3, including H3K4me3 Hek293 bed, IgG Hek293 bed, H3K4me3 K562 bed, IgG K562 bed. However, I realized sicer_df in SICER2 only needs two groups of bed files as input ("sicer_df -t treatment1.bed treatment2.bed -c control1.bed control2.bed -s hg38"). However, I have four groups of bed files if I take IgG into account. So, I am stuck in how to identify differential H3K4me peaks between cell lines. Could you provide any solutions? Thank you very much!

mbassalbioinformatics commented 3 years ago

Thanks for your detailed reply. I see your script "Calling SEACR on Nuclear files" and "Calling SEACR on Regular files", I am not sure the meanings of nuclear files and regular files. I suppose nuclear files are normalized target bedgraph like human H3K4me1, and regular files are input bedgraph like igG, right?

I see the predicament here and thats my bad for not naming them properly. The 'regular' files here are actually whole cell extract (WCE) rather than just nuclear extracts. Other than that, the files are the same. In our original investigations we tried both WCE and nuclear extracts to see which will work better.

Yes, after using your script, we could get peak bed files based on SEACR. However, I do have difficulty doing identifying differential peaks. ①Diffbind requires me to provide peak files (in my case, H3K4me3) and two bam files (in my case, H3K4me3 bam and IgG bam) for one entry. After your scripts, I have four bam files (hg38 H3K4me3, saccer3 H3K4me3, hg38 IgG and saccer3 IgG), so I don't know how to integrate these four bams into one H3K4me3 and one IgG as input in Diffbind.

Another way is identifying differential peaks based on bedgraph files, and SICER2(https://zanglab.github.io/SICER2/) may work in this scenario. For example, I wanted to identify differential H3K4me3 between HEK293 and K562. After your scripts, I got four normalized bed files after scaling by saccer3, including H3K4me3 Hek293 bed, IgG Hek293 bed, H3K4me3 K562 bed, IgG K562 bed. However, I realized sicer_df in SICER2 only needs two groups of bed files as input ("sicer_df -t treatment1.bed treatment2.bed -c control1.bed control2.bed -s hg38"). However, I have four groups of bed files if I take IgG into account. So, I am stuck in how to identify differential H3K4me peaks between cell lines. Could you provide any solutions? Thank you very much!

For differential peak binding, I wrote some custom scripts for that but at the time I wasn't all that familiar with HOMER. Now what I'd suggest is HOMER's mergePeaks (http://homer.ucsd.edu/homer/ngs/mergePeaks.html) which would be MUCH easier to use with the output of CnRAP than DiffBind of SICER2 also with much easier options to control the merge. With mergePeaks you'd have to reformat the output peak files (very easy to be done) but then HOMER will take any number of files and can give you all the output combinations of files. You can also control the merge window, so either exact or within a certain range.

So for what you want, I'd suggest mergePeaks and go from there. I think you will find this is hte easier way to get the result you want.

I will say though, we are currently working on a completely revamped C&R pipeline which will be much better than CnRAP. Its in the final stages of development and hopefully will be ready for public use in the next couple of months.

pastaxingxin commented 3 years ago

Thanks for your suggestions. I will try it soon and let you know :).