waterlandlab / CluBCpG

Cluster-based analysis of CpG methylation
https://clubcpg.readthedocs.io/
MIT License
10 stars 6 forks source link

Not able to reproduce results for SampleData #12

Closed pelutz closed 3 years ago

pelutz commented 3 years ago

Describe the bug I installed CluBCpG in a Conda environment on a linux server, and was able to run the test_Module.py successfully:

image

But when I apply the clubcpg-coverage command to the A_test.chr19.bam file, I get a different output (with 222 lines) from the one available on GitHub (at: https://github.com/waterlandlab/CluBCpG/tree/master/SampleData/COVERAGE/CompleteBins.A_test.chr19.bam.chr19.csv - this file has 562 lines), with some missing bins and different numbers of reads or even CpGs for some bins: chr19_3079700,2,3 chr19_3079800,13,2 chr19_3080000,2,8 chr19_3080100,16,1 chr19_3080200,5,8 chr19_3080300,16,1 chr19_3080400,24,1 chr19_3080500,5,1 chr19_3080800,4,1 chr19_3081300,12,1

I see 2 possible explanations: 1) Clubcpg does not interact properly with samtools in my installation. Does the test_Module evaluate this interaction? 2) The SampleData and COVERAGE files on GitHub do not match?

Thanks in advance for your help, PE

To Reproduce clubcpg-coverage -a /b/home/path/CluBCpG/SampleData/A_test.chr19.bam -o /b/home/path/tests/ --bin_size 100 -chr chr19 --read1_5 0 --read1_3 0 --read2_5 0 --read2_3 0

canthonyscott commented 3 years ago

Thank you! This is a good catch. On March 26th, 2020 (about 11 months ago now) we released a new version which contained a bug fix. This fix improved the CpG callings in bins with certain types of reads. The sample data in the repo's last commit was updated 15 months ago. This sample data was generated prior to this new fix and will contain differences.

Thank you for pointing this out. I will update this sample data and add it to the repo. The results you are getting when you run the data should be correct. The sample data is just old.

canthonyscott commented 3 years ago

Sample data has been updated on the master branch. If you find another discrepancy please feel free to comment here or open a new issue.

pelutz commented 3 years ago

Great, thanks for the fast response! I can confirm that we now have exactly identical results, for both the coverage and clustering functions.