perishky / meffil

Efficient algorithms for analyzing DNA methylation data.
Artistic License 2.0
53 stars 28 forks source link

Normalization function creates many NAs #51

Open rfael0cm opened 1 year ago

rfael0cm commented 1 year ago

Dear Mathew, The meffil.normalize.samples function is producing many NAs in more than one of my data sets. If I wish to continue only with the CpGs that have normalized values for all the samples I end up with only 477,376 probes. Is there a way to solve that? I do not remember finding this problem when using the package previously. Thanks a lot! Rafael

perishky commented 1 year ago

Hi Rafael,

Sorry to hear that. NA values are due to probe values being identified as unreliable during the QC step, so the QC report would give some idea why they are being considered unreliable (e.g. low probe intensity, low bead count). The first step to understanding the change from an older version of the package to the most recent version would be to compare the most recent QC report for a dataset to a previous QC report for the same dataset. I would be happy to look at them if you could send them to me. It would also be useful to see the code used to perform QC and to generate the report.

Best, Matt

rfael0cm commented 1 year ago

Hey Matt, thank you for your answer. Aren't these bad cpgs removed by specifying them in the cpglist.remove variable in meffil.normalize.samples? According to the qc summary, I have only 996 bad probes (detection p-value and low bead number). Sorry for the misunderstanding, what I meant is that by using an older version in a completely different data set I did not find this error. I wondered if it might be due to the newest versions or some changes. There is also the possibility that the two data sets that I am using now have terrible quality. If nobody else has complained about it, then I will try to check further my data sets. Please, let me know if something comes to your mind that might explain this, or fix it. Thanks a lot! Rafael

rfael0cm commented 1 year ago

Hey Matt, just realized that when I used previously meffilnormalize.samples I had remove.poor.signal = F. My bad. I was wondering, how recommendable is to set the parameter to TRUE? seems like there are many rows lost if I set it to TRUE. Thanks a lot! Rafael

perishky commented 1 year ago

Hi Rafael,

I can't see any changes in the last couple of years would affect identification of 'bad' probes. Seems most likely that the quality of the new dataset is slightly lower than previously. It's possible that you may have had low levels of DNA or highly fragmented DNA for some samples. remove.poor.signal=T will set any 'bad' probe values to NA (bad = poor detection or low bead count). I've left it as a user option because some algorithms require matrices with no missing values (e.g. PCA). In these cases, you might want remove.poor.signal=F. However, if you will be running an EWAS, then it would probably make sense to set remove.poor.signal=T so associations aren't influenced by low quality methylation measurements. The meffil EWAS function can handle outlier values, so it's possible that outlier handling could be used to remove some low-quality measurements. However, it's probably best to remove probe values that are known to be low quality. Matt