smithlabcode / dnmtools

Tools for analyzing DNA methylation data
https://dnmtools.readthedocs.io
GNU General Public License v3.0
27 stars 9 forks source link

roi -M return different values for "number of CpGs in the region" for different count files #147

Closed moqri closed 1 year ago

moqri commented 1 year ago

Describe the bug roi -M return different values for "number of CpGs in the region" for different count files

To Reproduce run roi -M on the same HMR file (with one HMR region) using two different count files

Expected behavior Same columns 10 values (as number of CpGs in a region should only depend on the HMR region, if I understand correctly)

Screenshots image

Desktop (please complete the following information): Linux 7

Additional context dnmtools/1.2.2

moqri commented 1 year ago

Digging deeper, I think I found the source of the issue:

It seems that the roi command counts the number of CpGs in each region reported in each count file (not from the reference) so if the count file does not have the CpG, roi does not count it for column 10. Maybe just a clarification in the docs would be helpful?

moqri commented 1 year ago

Solution for others who might encounter the same issue when using count data from other sources:

If you are creating you count data using wgbs_tools beta2bed, use "--keep_na" for consistency with dnmtools.

andrewdavidsmith commented 1 year ago

Thanks for this @moqri it is indeed something we would want to clarify. We will also be trying to have safeguards in place for such things.