smithlabcode / dnmtools

Tools for analyzing DNA methylation data
https://dnmtools.readthedocs.io
GNU General Public License v3.0
25 stars 8 forks source link

hmr requires symmetric data, fails on lifted over .meth files #43

Open faulk-lab opened 1 year ago

faulk-lab commented 1 year ago

I have perhaps an unusual use case. I am using whole genome methylation data from multiple closely related vertebrates. In order to compare apples-to-apples, I need a common coordinate system so I lifted them all over to human hg38. Now I want to run them through hmr (and dmr and several other of your wonderful tools!). However hmr is complaining that error: input is not symmetric-CpGs: .

My data is inherently symmetric because I specified merged symmetric coverage when I generated the bed file (that I converted to .meth with an awk script). My original data is from nanopore, and counts generated with modkit. It was direct methylation calling, not bisulfite, but once it has been converted to .meth format, I figured it didn't matter anymore.

Anyway, dnmtools hmr monkey.meth works fine, but after I lift it over to hg38, and filtered the lines with dnmtools liftfilter I tried dnmtools hmr monkey.hg38.meth and it fails with the symmetric problem. I wrote a script to remove any lines that "looked symmetric" and still no joy. Ideas?

faulk-lab commented 1 year ago

Hmm I think my filtering script is not doing a good job of actually removing "symmetric looking" sites. Trying to figure out a way to deal with them. Why is it necessary for dnmtools hmr to require non-symmetric looking input anyway? I imagine it's statistically superior to merge the symmetric sites, but in my case, that's probably not biologically meaningful, in lifted-over data for example.

andrewdavidsmith commented 1 year ago

The sites need to be CpG sites, or the concept of HMRs wouldn't apply. But you are correct that in theory they don't need to be "symmetric" in the file anymore. There's now only a historical artifactual reason that check exists. But I can't just disable it completely right now. If you can build the tools from source, I can probably make a branch relatively quickly that disables that particular check. Also, if you could point me to any public data sets that are sufficiently similar to your own, I could test it myself. Either way, we can find out if it even makes sense to do that analysis on your data.

faulk-lab commented 1 year ago

Thanks, I can probably build from source if you make a test branch. The sites are definitely CpG only. That's the only ones I called from the original sequencing run. Here's a partial dataset I am working with, just chromosome 1 of a macaque that has been lifted over to hg38. https://drive.google.com/file/d/1VvaaKMgvO1RULYvvYSI5Sv9WmwegcwSu/view?usp=drive_link

andrewdavidsmith commented 1 year ago

@faulk-lab I just made a branch hmr_warn_nonsymmetric_cpgs (poor choice of branch name) that should do what you are asking for. It includes a flag that will allow the program to continue running even if it sees 2 CpG sites at a distance of 1 from each other. I tested it on a file that had a single such site, and it worked. But it's not really fully tested. If the CpGs are symmetric, it should do the same thing as before. If the CpGs include some at a distance of 1, then it should continue just fine, but I don't know what it will do if they are all like that. It's possible that the statistical model will be impacted. I don't have time right now to fully verify that.

andrewdavidsmith commented 1 year ago

@faulk-lab @moqri I'm adding information here from another issue opened on the same topic:

Describe the bug We are using a data CpG count data which is converted from another pipeline (so missing some CpGS) but overall symmetric. running hmr gives an error but hmr-rep works fine

To Reproduce source: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM5663172 "beta" file is converted to bed file and lifted to hg38. the file is reformatted and sorted to match meth-count. then running hmr on this data will result in "error: input is not symmetric-CpGs"

However, running hmr-rep with two identical copies of the processed data works fine!

Expected behavior not clear if all (symmetric) CpGs are required for hmr to work or if there is another issue (e.g. two consecutive CpG coordinate?). not clear why hmr-rep works but hmr doesn't

Thanks for the great work!