hippo-yf / bsgenova

An accurate, robust, and fast genotype caller for bisulfite-sequencing data
GNU General Public License v3.0
1 stars 2 forks source link

Any Software recommendation for mapping/processing? #1

Closed desmodus1984 closed 1 month ago

desmodus1984 commented 3 months ago

Hi,

I am analyzing methylation data- finding differentially methylated sites/regions, and I wanted to inquire about suggestions whether my input is good or if I should preprocess with better software. My collaborator enzyme-converted DNA with the NEBNext® Enzymatic Methyl-seq Kit](https://www.neb.com/en-us/products/e7120-nebnext-enzymatic-methyl-seq-kit, and sequenced with Nova-seq. I removed adapters with trimgalore!, and then mapped the reads with bwa-meth, bam converted and sorted, and then used Picard to eliminate duplicate reads.

According to two papers, the bayesian model of cgmaptools is the suggested tool for non-model organisms. Hence I used the sorted-duplicate-removed bam files and generated the ATCGmap files and did the variant calling. My surprise was that the vcf file had degenerated bases.

Thus, I wanted first to ask if the sorted-deduplicated bam files could be used as input or if I should use the ATCGmap generated from them.

Thanks;

hippo-yf commented 3 months ago

1, using the sorted-deduplicated bam files as input to generate the ATCGmap files is ok 2, use ATCGmap files in cgmaptools/bsgenova for variant calling 3, "My surprise was that the vcf file had degenerated bases.", you mean base wildcards in the vcf file generated by cgmaptools ? 4, if you are interested in differentially methylated sites/regions, variant calling is optional 5, according to my experience, model organism or not do not affact too much to choose DMR software, 6, https://pubmed.ncbi.nlm.nih.gov/36147684/ for an evaluation of mapper

desmodus1984 commented 3 months ago

Hi What I meant in degenerated bases is that when I base-called the ATCGmap file with cgmaptools, I got in all cases degenerate bases in the alternative allele called, which I was shocked, and totally unexpected.

For instance: V00001

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001

NW_022882922.1 28895 . C T 0 PASS NS=1:DP=52 GT:GQ:DP 0/1:0:52 NW_022882922.1 36586 . C T,Y 0 PASS NS=1:DP=23:GU=T/C GT:GQ:DP 1/2:0:23 NW_022882922.1 36640 . G A 0 PASS NS=1:DP=40 GT:GQ:DP 1/1:0:40 NW_022882922.1 39071 . A G 0 PASS NS=1:DP=43 GT:GQ:DP 1/1:0:43 NW_022882922.1 39625 . G C 0 PASS NS=1:DP=43 GT:GQ:DP 1/1:0:43 NW_022882922.1 40063 . G A 0 PASS NS=1:DP=31 GT:GQ:DP 0/1:0:31 NW_022882922.1 40069 . G C 0 PASS NS=1:DP=29 GT:GQ:DP 0/1:0:29 NW_022882922.1 43671 . A C 0 PASS NS=1:DP=9 GT:GQ:DP 0/1:0:9 NW_022882922.1 46989 . G A 0 PASS NS=1:DP=18 GT:GQ:DP 0/1:0:18 NW_022882922.1 51219 . T C 0 PASS NS=1:DP=47 GT:GQ:DP 0/1:0:47 NW_022882922.1 54317 . A T 0 PASS NS=1:DP=46 GT:GQ:DP 0/1:0:46 NW_022882922.1 57937 . T C 0 PASS NS=1:DP=24 GT:GQ:DP 0/1:0:24 NW_022882922.1 58214 . C T,Y 0 PASS NS=1:DP=15:GU=T/C GT:GQ:DP 1/2:0:15 NW_022882922.1 61672 . C T,Y 0 PASS NS=1:DP=15:GU=T/C GT:GQ:DP 1/2:0:15 NW_022882922.1 64048 . G A,R 0 PASS NS=1:DP=25:GU=A/G GT:GQ:DP 1/2:0:25 NW_022882922.1 64162 . A G,R 0 PASS NS=1:DP=24:GU=A/G GT:GQ:DP 1/2:0:24 NW_022882922.1 64325 . C T 0 PASS NS=1:DP=27 GT:GQ:DP 0/1:0:27 NW_022882922.1 65311 . G A 0 PASS NS=1:DP=29 GT:GQ:DP 0/1:0:29 NW_022882922.1 65957 . T G,R 0 PASS NS=1:DP=16:GU=A/G GT:GQ:DP 1/2:0:16

V00021

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001

NW_022882922.1 25160 . G Y 0 PASS NS=1:DP=34:GU=T/C GT:GQ:DP 0/1:0:34 NW_022882922.1 25676 . T C 0 PASS NS=1:DP=41 GT:GQ:DP 0/1:0:41 NW_022882922.1 28342 . G A,R 0 PASS NS=1:DP=35:GU=A/G GT:GQ:DP 1/2:0:35 NW_022882922.1 29887 . C A 0 PASS NS=1:DP=48 GT:GQ:DP 0/1:0:48 NW_022882922.1 36640 . G A 0 PASS NS=1:DP=39 GT:GQ:DP 0/1:0:39 NW_022882922.1 37240 . T A 0 PASS NS=1:DP=27 GT:GQ:DP 0/1:0:27 NW_022882922.1 37707 . G A 0 PASS NS=1:DP=43 GT:GQ:DP 0/1:0:43 NW_022882922.1 39625 . G C 0 PASS NS=1:DP=42 GT:GQ:DP 0/1:0:42 NW_022882922.1 40063 . G A 0 PASS NS=1:DP=41 GT:GQ:DP 0/1:0:41 NW_022882922.1 47037 . A Y 2 PASS NS=1:DP=20:GU=T/C GT:GQ:DP 0/1:4:20 NW_022882922.1 47046 . T G,Y 0 PASS NS=1:DP=21:GU=T/C GT:GQ:DP 1/2:0:21 NW_022882922.1 49125 . G A 0 PASS NS=1:DP=56 GT:GQ:DP 0/1:0:56 NW_022882922.1 55379 . C A,Y 0 PASS NS=1:DP=11:GU=T/C GT:GQ:DP 1/2:0:11 NW_022882922.1 57937 . T C 0 PASS NS=1:DP=28 GT:GQ:DP 0/1:0:28 NW_022882922.1 58214 . C T 0 PASS NS=1:DP=13 GT:GQ:DP 0/1:0:13 NW_022882922.1 61672 . C T,Y 0 PASS NS=1:DP=16:GU=T/C GT:GQ:DP 1/2:0:16

Thanks;

Juan

hippo-yf commented 2 months ago

as i known, the Y, R's are wildcards of bases, when the genotype can not be toally determined.
see Figure 4.3 in https://cgmaptools.github.io/cgmaptools_documentation/snv-calling.html image