rsinghlab / AltumAge

BSD 3-Clause "New" or "Revised" License
30 stars 9 forks source link

Questions about missing CpGs #5

Closed liguowang closed 11 months ago

liguowang commented 1 year ago

Hi, I can reproduce your results using the example dataset. But have some problems to run AltumAge on our own dataset. I have two basic questions as below:

Q1: Does AltumAge need exactly 20318 CpGs to run the prediction? Q2: The author recommends using "BMIQcalibration" normalized beta values as input. However, it seems BMIQcalibration reduces "beta-matrix to the 18747 CpGs", how can we get the 20318 CpGs after BMIQcalibration ?

Thanks Liguo

lcamillo commented 1 year ago

Dear Liguo,

The model was trained with 20318 CpGs, so the best performance would require all of them. Nevertheless, given the nature of the network and the training, it is expected that even if not all CpG sites are there, AltumAge should still perform decently. Illumina's EPIC arrays (for some reason) have slightly different numbers of CpGs, and of course not all pass QC. These are probably the reasons why there are only 18747 after BMIQ. The recommendation for missing CpG values is to just zero them out after scaling with the RobustScaler.

Hope this helps.

liguowang commented 1 year ago

Thank you very much for your quick response. It's good to know that AltumAge still works decently with some missing CpGs.

Some update, even though I provided all the CpGs from the epic 850 array (without any filtering, a total of 865859 CpGs) to AltumAge, I still see there are 12 CpGs missing. Not sure why, maybe Illumina removed these probes?

Also, I cannot run RobustScaler if there are missing CpGs. Give me this error: ValueError: X has 20306 features, but RobustScaler is expecting 20318 features as input. Is it OK to insert all missing CpGs (with beta values as zeros) into the data frame before running RobustScaler?

12 CpGs missed from Epic Array but required by AltumAge.

{'cg24497877', 'cg22295573', 'cg13326338', 'cg06007645', 'cg21968169', 'cg19167673', 'cg14329157', 'cg11654620', 'cg22680204', 'cg23032316', 'cg15565533', 'cg15869022'}

Best,

Liguo

lcamillo commented 1 year ago

This is talking from experience, but it seems that for the EPIC v1 array, Illumina produced very slightly different versions with different subsets of CpGs (I'm not sure if this is also the case with EPIC v2). If you look at different papers that used EPIC v1, they usually report different numbers of probes. The difference is usually less than 0.1% of CpG sites. If only 12 are missing, zeroing them out after scaling should have virtually no impact on the prediction.

liguowang commented 1 year ago

Hi, Want to share more information after I tried AltumAge. We have 20 samples, the epic array data were processed using the minfi package which applied Quantile normalization to the beta values. We also correct batch effects using the combat function from the sva package. We did NOT use the BMIQ calibrated data because it reduced the beta value matrix to ~18,000 CpG (missing values exceeds 20% for Hovath clocks).

In short, I found the 12 CpGs mentioned earlier significantly impact the predicted age. Since these 12 CpGs were removed from our Epic array, I manually inserted them back with two kinds of "pseudo methylation values": 1) all zeros. The predicted age is marked as "AltumAge_0" in the table below. 2) the mean values of all available probes for each sample. (this might not make perfect sense, but here we want to evaluate how the 12 CpGs affect the final prediction). The predicted age is marked as "AltumAge_mean" in the table below.

Horvath_2013 and Horvath_2018 were also calculated as references.

Obervations about AltumAge: 1) the correlations are decent (r = 0.84 and r = 0.87). But the Residuals (the difference between the actual and predicted values) are relatively large. 2) the 12 CpGs (0.06% of all the CpGs used by AltumAge) greatly impact the prediction.

Did I miss anything or do anything wrong?

Sample_ID Age Horvath_2013 Horvath_2018 AltumAge_mean AltumAge_0 s64N 49 58.47419252 59.8670955 50.631897 48.678112 s72N 56 64.39910567 59.22809727 54.357853 52.101345 s62N 57 61.99877625 67.66077588 53.983173 51.640491 s63N 57 61.71646176 65.09349075 52.7098 49.81702 s73N 57 63.76538251 67.11313949 56.21239 53.745541 s36N 60 65.91956017 69.18338229 52.03167 50.21751 s68N 60 64.2655496 68.1667626 54.61778 52.389278 s69N 60 67.02423968 67.68008996 59.08518 56.165531 s45N 61 67.67669846 67.23958129 61.632748 58.480209 s60N 62 62.54077402 67.74347415 52.199352 49.809044 s16N 65 67.51525021 72.08996462 57.76328 54.432926 s58N 65 68.22414886 73.16267883 52.68145 50.850163 s46N 66 69.34162376 66.63600637 56.71151 55.122692 s70N 67 66.64620845 69.27134423 57.38026 54.752045 s55N.2 68 68.98898036 71.53541058 60.01783 57.210129 s76N 74 73.8343614 70.51197631 61.251125 59.419117 s71N 75 70.48450387 75.09955635 62.44051 59.430168 s75N 75 77.90403049 76.30641086 61.66109 60.217045 s77N 80 80.16928515 79.72629142 67.687256 65.953186 s74N 82 80.21085409 81.68738063 64.76565 63.121395

lcamillo commented 1 year ago

Hi,

In terms of substituting values in CpG sites, could I just double check you are doing it after scaling? The reason for this is that if, let's say, the beta of CpG X has a value of 0.95 and a standard deviation of 0.05, then setting it to 0 (or even the mean per sample, which tends to be around 0.2-0.4) is going to result in a massive value after processing with the RobustScaler (several standard deviations off the median). Even if the CpG is not very important, this would likely impact the value.

When it comes to the residuals, Horvath 2013 and AltumAge indeed sometimes display different behaviors. For instance, in an OSKM rejuvenation time course, initially, the Horvath age goes up and AltumAge goes down, and the difference in absolute predicted age between the two reaches about 20 years.

Lastly, when it comes to normalization, I would check this paper for general recommendations: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02793-w.

Hope this helps and thanks for sharing your results!