heathsc / gemBS

gemBS is a bioinformatics pipeline designed for high throughput analysis of DNA methylation from Whole Genome Bisulfite Sequencing data (WGBS).
GNU General Public License v3.0
32 stars 21 forks source link

Can it be used for amplification based targeted methylation analysis? #39

Open vrbacky opened 5 years ago

vrbacky commented 5 years ago

Hi,

gemBS seems to be a very interesting toolkit. My wet lab colleagues use targeted approach - they amplify specific regions of bisulfite converted DNA and analyze it using amplicon sequening on Illumina MiSEQ. I came to their data rather by an accident because they were unable to use their methods.

I've tried to analyze their data using gemBS. I can map the data without any problem - paired end data are mapped to a sequence used as a reference. Almost all reads are mapped and are reported as correct pairs in gemBS mapping report. Samtools flagstat reports them as properly paired too. But I get a lot of warning and errors during methylation and variant calling. All reads are reported as follows in calling err file:

[W::vcf_parse] Contig 'Warning not found: M03647:172:000000000-D4RY5:1:1102:4243:8025 4858 4863 +' is not defined in the header. (Quick workaround: index the file with tabix.)
...
[E::bcf_write] Broken VCF record, the number of columns at Warning not found: M03647:172:000000000-D4RY5:1:1102:4243:8025 4858 4863 +:1 does not match the number of samples (0 vs 1)
...

I can avoid the messages by changing keep_improper_pairs option to False but I get all reads as PairNotFound in calling json log.

Is there any way how to analyze such data?

Thank you very much.

My config is:

# Required section
#
# Note that the index and contig_sizes files are generated from the
# reference file if they do not already exist
#

reference = reference/ctnnd2.fa

#
# This is for the control sequences.  The contigs here will
# be used for mapping, but will not be passed to the caller
#
# extra_references = reference/conversion_control.fa.gz

index_dir = indexes

#
# The variables below define the directory structure for the results files
# This structure should not be changed after the analysis has started
#

base = .
sequence_dir = ${base}/fastq
bam_dir = ${base}/mapping/@BARCODE
bcf_dir = ${base}/calls/@BARCODE
extract_dir = ${base}/extract/@BARCODE
report_dir = ${base}/report

#
# End of required section
#

# The following are optional

project = gembs_test_ctnnd2
species = human

threads = 20
jobs = 6

[index]

sampling_rate = 4

[mapping]

non_stranded = False
remove_individual_bams = True

#underconversion_sequence = NC_001416.1 
#overconversion_sequence = NC_001604.1

[calling]

mapq_threshold = 10
qual_threshold = 13
reference_bias = 2
left_trim = 5
right_trim = 0
keep_improper_pairs = True
keep_duplicates = True
haploid = False
conversion = auto
remove_individual_bcfs = True

# Contigs smaller than contig_pool_limit will be called together
contig_pool_limit = 25000000

[extract]

strand_specific = True
phred_threshold = 10
make_cpg = True
make_non_cpg = True
make_bedmethyl = True
make_bigwig = True
heathsc commented 5 years ago

Thanks for the bug report. The original error was caused by a warning message that was finding its way into the vcf output causing problems downstream. I have fixed this, but the problem remains that for some reason bscall is having problems finding both members of a pair. Is there any chance that you could share a part of your input bam file so that I can see what is happening?

heathsc commented 5 years ago

You can send the file directly to my email address to avoid sharing it with everyone... (simon.heath@gmail.com)

vrbacky commented 5 years ago

Sure. Thank you very much. More by email.