hansenlab / minfi

Devel repository for minfi
60 stars 70 forks source link

wrong probe types used for EPIC samples in `preprocessNoob` due to b2-based manifest and b4-based annotation #243

Open maximilian-leitheiser opened 1 year ago

maximilian-leitheiser commented 1 year ago

Thanks a lot for making minfi available to the scientific community! I have used it extensively in my research over the last few years. However, I think I might have come across a mistake in preprocessNoob that occurs for EPIC (v1) samples:

After reading such an EPIC sample from GEO

> library(GEOquery)
> library(minfi)
> library(stringr)

> # get IDAT files from GEO
> GEO_files <- getGEOSuppFiles("GSM5239506")
> sapply(rownames(GEO_files), gunzip, overwrite = TRUE)

> # read IDATs
> IDAT_basename <- rownames(GEO_files)[[1]]
> IDAT_basename <- paste(head(unlist(str_split(IDAT_basename, "_")), n = -1), collapse = "_")
> rgSet <- minfi::read.metharray(basenames = IDAT_basename_2)
> dim(rgSet)
[1] 1051815       1
> annotation(rgSet)
                         array                     annotation 
"IlluminaHumanMethylationEPIC"                 "ilm10b4.hg19"

and executing this first code chunk from preprocessNoob

> oob <- getOOB(rgSet)
> GreenOOB <- oob[["Grn"]]
> RedOOB <- oob[["Red"]]
> MSet <- preprocessRaw(rgSet)
> probe.type <- getProbeType(MSet, withColor = TRUE)
> Green_probes <- which(probe.type == "IGrn")
> Red_probes <- which(probe.type == "IRed")
> d2.probes <- which(probe.type == "II")
> Meth <- getMeth(MSet)
> Unmeth <- getUnmeth(MSet)

the number of CpGs in the Meth/Unmeth matrices and probe.type are not the same:

> nrow(Meth)
[1] 866091
> length(probe.type)
[1] 865859

Hence, probe.type does not correspond to the actual probe types of Meth and Unmeth and the following subsequent indexing of them in .preprocessNoob with Green_probes and so on will produce incorrect results:

# NormExp estimates for Green and Red
        dat <- list(
            Green = list(
                M =  Meth[Green_probes, , drop = FALSE],
                U =  Unmeth[Green_probes, , drop = FALSE],
                D2 = Meth[d2.probes, , drop = FALSE]),
            Red = list(
                M =  Meth[Red_probes, , drop = FALSE],
                U =  Unmeth[Red_probes, , drop = FALSE],
                D2 = Unmeth[d2.probes, , drop = FALSE]))

The original discrepency in CpG sites occurs because

and these two versions of the manifest file have a differing number of probes.

For our EPIC sample from above, making a quick-and-dirty b2-based version of probe.info leads to the following results:

> # make probe.info version based on b2 manifest
> type_1 <- getManifest(rgSet)@data$TypeI
> type_1 <- type_1[, c("Name", "AddressA", "AddressB", "Color")]
> type_1$Type <- "I"

> type_2 <- getManifest(rgSet)@data$TypeII
> type_2 <- type_2[c("Name", "AddressA")]
> type_2$Color <- ""
> type_2$AddressB <- ""
> type_2$Type <- "II"

> type_df <- rbind(type_1, type_2)
> type_df <- type_df[match(rownames(MSet), type_df$Name), ]
> probe.type_b2 <- paste0(type_df $Type, type_df_fitted$Color)
> Green_probes_b2 <- which(probe.type_b2 == "IGrn")
> Red_probes_b2 <- which(probe.type_b2 == "IRed")
> d2.probes_b2 <- which(probe.type_b2 == "II")

> # check probes the original method is actually indexing
> table(probe.type_b2[Green_probes])

 IGrn  IRed 
17984 31955 
> table(probe.type_b2[Red_probes])

 IGrn  IRed 
31956 60242 
> table(probe.type_b2[d2.probes])

  IGrn     II   IRed 
    14 723678     30 

I'd be curious to hear your thoughts! Am I missing something, or did I setup something incorrectly?

Kind regards, Max