sartorlab / methylSig

R package for DNA methylation analysis
17 stars 5 forks source link

clarification on local window size #45

Closed shawpa closed 4 years ago

shawpa commented 4 years ago

I am a bit confused as to the definition of "local window size" in the diff_methylsig function. In the manual it states:

"an integer indicating the size of the window for use in determining local information to improve mean and dispersion parameter estimations. In addition to a the distance constraint, a maximum of 5 loci upstream and downstream of the locus are used. The default is 0, indicating no local information is used."

Does this mean that it can only process values 0-5? In an older version of the program it was based on base pairs and the default was 200. Can you provide any additional information on the advantages of changing the default from 0. If you have loci that are very far apart, is there a maximum distance that it will consider "nearest loci".

I ask because I am getting very strange mean case/control methylation values. Swings that are more than just a couple of percentage points (like 50+) while my percent methylation matrix is very consistent across close loci.

Thank you,

Annie

rcavalcante commented 4 years ago

Hi,

Does this mean that it can only process values 0-5?

No, it accepts any integer (signifying the bp size of the window), but the additional constraint limits the number of loci to 5 upstream and 5 downstream, regardless of the window size.

This constraint has been in the package since at least v0.4.1 see here. However, I will note that as it was implemented by the original author, this didn't actually have the effect of allowing 5 upstream and downstream, but 5 total.

Can you provide any additional information on the advantages of changing the default from 0.

The advantage of using something other than 0 is the possibility of better estimating the dispersion of the methylation at a locus. However, this is based on the assumption that nearby loci have correlated methylation rates. It has been observed that methylation is often correlated locally, on the scale of hundreds of base pairs.

If you have loci that are very far apart, is there a maximum distance that it will consider "nearest loci".

If a very large window size is selected, say 10kb, chances are that you'll run up against the nearest 5 loci upstream/downstream before you get to 10kb for most loci, but that will be variable on a locus by locus basis, as you can imagine CpGs are unevenly distributed across the genome.

I ask because I am getting very strange mean case/control methylation values. Swings that are more than just a couple of percentage points (like 50+) while my percent methylation matrix is very consistent across close loci.

When the local_window_size is used, the methylation rates of the loci satisfying the conditions above are used to not only calculate the dispersion, but also the methylation. This aspect of the implementation has tended to lead to confusion.

As an example, say I'm testing locus A for differential methylation and it has 8 loci within the local window I've specified (4 upstream and 4 downstream). The methylation rate reported for locus A will actually be the combined methylation of all 9 loci subject to the weight function with the tested locus at the center. So looking back at the count data for just locus A will not match what methylSig reports. And if it turns out that the local loci are not correlated with locus A, you may see very strange methylation rates compared to the count data.

I've discussed this with the authors of the paper and in the next Bioconductor release we will have additional flags indicating whether to use local information just for the dispersion calculation or for both dispersion and methylation. The documentation will explain the possibility for mismatches with count data.

Hope that helps, Raymond