waterlandlab / CluBCpG

Cluster-based analysis of CpG methylation
https://clubcpg.readthedocs.io/
MIT License
10 stars 6 forks source link

Bug in clustering with 150-bp bins #17

Closed mohamadysn closed 2 years ago

mohamadysn commented 2 years ago

We are using BAM files that were query name sorted, deduplicated using Picard, and then coordinate sorted and indexed using samtools. Since our data were generated using 150bp paired-end sequencing, we’re interested in using CluBCpG to analyze bins of 150bp, possibly allowing for more complex patterns. However, comparisons of results using bins of 100bp versus 150bp bins yields surprising results for 150bp. Our data have an average coverage of just below 10x, and we filter & keep bins with >= 5 reads and >= 2 cpgs per bin. In the csv generated by the clubcpg-cluster function:

Results clustering 100bp: bin,input_label,methylation,class_label,read_number,cpg_number,cpg_pattern,class_split chr19_3079000,EMseq_FM1_1M_reads_chr19.bam,1.0,0,7,2,1;1,EMseq_FM1_1M_reads_chr19.bam=7 chr19_3079600,EMseq_FM1_1M_reads_chr19.bam,1.0,0,7,2,1;1,EMseq_FM1_1M_reads_chr19.bam=7 chr19_3079700,EMseq_FM1_1M_reads_chr19.bam,1.0,0,5,5,1;1;1;1;1,EMseq_FM1_1M_reads_chr19.bam=5 chr19_3079800,EMseq_FM1_1M_reads_chr19.bam,0.5,0,2,2,1;0,EMseq_FM1_1M_reads_chr19.bam=2 chr19_3079800,EMseq_FM1_1M_reads_chr19.bam,1.0,1,3,2,1;1,EMseq_FM1_1M_reads_chr19.bam=3 chr19_3080000,EMseq_FM1_1M_reads_chr19.bam,1.0,0,6,8,1;1;1;1;1;1;1;1,EMseq_FM1_1M_reads_chr19.bam=6 chr19_3083700,EMseq_FM1_1M_reads_chr19.bam,1.0,0,4,8,1;1;1;1;1;1;1;1,EMseq_FM1_1M_reads_chr19.bam=4 chr19_3084300,EMseq_FM1_1M_reads_chr19.bam,0.0,0,4,2,0;0,EMseq_FM1_1M_reads_chr19.bam=4 chr19_3084300,EMseq_FM1_1M_reads_chr19.bam,1.0,1,2,2,1;1,EMseq_FM1_1M_reads_chr19.bam=2

Results clustering 150bp: bin,input_label,methylation,class_label,read_number,cpg_number,cpg_pattern,class_split chr19_3094650,EMseq_FM1_1M_reads_chr19.bam,1.0,1,6,2,1;1,EMseq_FM1_1M_reads_chr19.bam=6 chr19_3096000,EMseq_FM1_1M_reads_chr19.bam,1.0,0,7,1,1,EMseq_FM1_1M_reads_chr19.bam=7 chr19_3096000,EMseq_FM1_1M_reads_chr19.bam,0.0,1,4,1,0,EMseq_FM1_1M_reads_chr19.bam=4 chr19_3096750,EMseq_FM1_1M_reads_chr19.bam,1.0,0,6,2,1;1,EMseq_FM1_1M_reads_chr19.bam=6 chr19_3099150,EMseq_FM1_1M_reads_chr19.bam,1.0,0,8,2,1;1,EMseq_FM1_1M_reads_chr19.bam=8 chr19_3100500,EMseq_FM1_1M_reads_chr19.bam,1.0,0,7,1,1,EMseq_FM1_1M_reads_chr19.bam=7 chr19_3102450,EMseq_FM1_1M_reads_chr19.bam,1.0,0,4,3,1;1;1,EMseq_FM1_1M_reads_chr19.bam=4 chr19_3106200,EMseq_FM1_1M_reads_chr19.bam,1.0,0,4,2,1;1,EMseq_FM1_1M_reads_chr19.bam=4 chr19_3111150,Emseq_FM1_1M_reads_chr19.bam,1.0,0,7,1,1,Emseq_FM1_1M_reads_chr19.bam=7

You can find in the directory with the link below, the bam and the index files, the csv files files obtained after using clubcpg if you want to see more examples of this bug with 150-bp, and a short file named "To_reproduce.txt" contains the script that I used to run clubcpg. With the bam file in the directory, it takes less than 10 min to reproduce it with clubcpg (coverage and cluster).

Bug_clubCpG

canthonyscott commented 2 years ago

Thanks for the bug report. I spot checked the first bin in your provided example files.

chr19_3079050: This region represents genomic region chr19:3078900-3079050. I went and examined this region of the mm10 genome and did indeed find that there should be 2 CpGs in this region. I will look into this issue as I can and post any updates here.

Thanks!

canthonyscott commented 2 years ago

I have identified a bug where the desired bin size was not being correctly passed to the cluster command. It was running on a bin size of 100 regardless of what we being passed via the command line. I am implementing a fix now

canthonyscott commented 2 years ago

A fix has been released with PR #18 . @mohamadysn please update your install of clubcpg to v0.2.5. You can use the following commands.

pip install --upgrade clugcpg or pip install clubcpg==0.2.5

Please open a new issue if any additional bugs are identified. Thanks again for catching this!

mohamadysn commented 2 years ago

Thanks for the fast response!

I can confirm that the bug is solved with this new version.

Thank you!