qzhang314 / DNAm-based-age-predictor

A chronological age predictor based on DNA methylation
17 stars 9 forks source link

Error in change of NAs to mean, and BLUP method #1

Closed Katterinne closed 4 years ago

Katterinne commented 4 years ago

Dear Zhang,

Thank you for the tool (it works and it's user-friendly, unlike DNAmAge).

This is to report what I think may be a bug in the transformation of missing data to the mean value when feeding the tool with an input with NAs in it.

Here is what happened to me: I have methylation data including several CpGs with NAs and when I run DNAm-based-age-predictor I get back the EN prediction, but the "blupred" column comes with NAs only. Of course, maybe I'm making a mistake here, I'm not sure. I hope you can help me. Please, let me walk you through it and provide my input data for you...

First, I created the required input for DNAm-based-age-predictor (RDS file) from my table containing the DNA methylation values at all CpG sites in all my samples.

In R:

# read DNA methylation data
data <- as.matrix(t(read.table(file = "Nexs_methyl.tsv", 
                               header = TRUE, sep = "\t", row.names = 1, as.is=TRUE)))
# export R object
saveRDS(data, file = "Nexs.rds")

Once having the RDS file ready, I ran the tool:

In the terminal:

$ Rscript pred.R -i Nexs.rds -o Nexs_age.pred -a Nexs.age 
[1] "1. Data loading and QC"
[1] "1.1 Reading the data"
[1] "1.2 Replacing missing values with mean value"
[1] "1.3 Standardizing"
[1] "2. Loading predictors"
[1] "3. Checking misssing probes"
[1] "0 probe(s) in Elastic Net predictor is(are) not in the data"
[1] "0 probe(s) in BLUP predictor is(are) not in the data"
[1] "BLUP can perform better if the number of missing probes is too large!"
[1] "4. Predicting"
[1] "Completed!!!"
$ cat Nexs_age.pred
ID age enpred blupred
Nex10 0 23.5517451927356 NA
Nex12 0 27.16853446086 NA
Nex18 0 26.9359837106917 NA
Nex6 0 36.3717094129017 NA
Nex8 0 22.8164441953937 NA

Using this Dropbox link you can download a zip file containing all the files mentioned above: Nexs_methyl.tsv, Nexs.rds, Nexs.age, and Nexs_age.pred.

Thank you in advance!

Regards, Katterinne

qzhang314 commented 4 years ago

Dear Zhang,

Thank you for the tool (it works and it's user-friendly, unlike DNAmAge).

This is to report what I think may be a bug in the transformation of missing data to the mean value when feeding the tool with an input with NAs in it.

Here is what happened to me: I have methylation data including several CpGs with NAs and when I run DNAm-based-age-predictor I get back the EN prediction, but the "blupred" column comes with NAs only. Of course, maybe I'm making a mistake here, I'm not sure. I hope you can help me. Please, let me walk you through it and provide my input data for you...

First, I created the required input for DNAm-based-age-predictor (RDS file) from my table containing the DNA methylation values at all CpG sites in all my samples.

In R:

# read DNA methylation data
data <- as.matrix(t(read.table(file = "Nexs_methyl.tsv", 
                               header = TRUE, sep = "\t", row.names = 1, as.is=TRUE)))
# export R object
saveRDS(data, file = "Nexs.rds")

Once having the RDS file ready, I ran the tool:

In the terminal:

$ Rscript pred.R -i Nexs.rds -o Nexs_age.pred -a Nexs.age 
[1] "1. Data loading and QC"
[1] "1.1 Reading the data"
[1] "1.2 Replacing missing values with mean value"
[1] "1.3 Standardizing"
[1] "2. Loading predictors"
[1] "3. Checking misssing probes"
[1] "0 probe(s) in Elastic Net predictor is(are) not in the data"
[1] "0 probe(s) in BLUP predictor is(are) not in the data"
[1] "BLUP can perform better if the number of missing probes is too large!"
[1] "4. Predicting"
[1] "Completed!!!"
$ cat Nexs_age.pred
ID age enpred blupred
Nex10 0 23.5517451927356 NA
Nex12 0 27.16853446086 NA
Nex18 0 26.9359837106917 NA
Nex6 0 36.3717094129017 NA
Nex8 0 22.8164441953937 NA

Using this Dropbox link you can download a zip file containing all the files mentioned above: Nexs_methyl.tsv, Nexs.rds, Nexs.age, and Nexs_age.pred.

Thank you in advance!

Regards, Katterinne

Hi Katterinne,

As you may have noticed, I did not check Github regularly, sorry for my late comment!

For your problem, since there are only 5 samples in your data file, some of the probes have NA value across all samples. Under such condition, the "replace NA" process did not work since it was designed to use the average value of DNA methylation across samples to replace the NA. And the NA value in the data would make the matrix multiplication did not work.

I have now updated the script to detect such probes and then remove them. Considering probes like this are less (especially when the sample size is large), I think removing them will not affect too much on the chronological age prediction.

Cheers, Qian

Katterinne commented 4 years ago

Hi Qian,

No problem at all, on the contrary, thank you very much for answering !! And I'm sorry for my late reply, I was out for a couple of weeks.

Well, that makes a lot of sense. Shame on me for not noticing it myself! Thanks a lot for the script update, works without problem now :D

Cheers, Katterinne