hansenlab / minfi

Devel repository for minfi
58 stars 70 forks source link

estimateCellType w/CordBlood #167

Open kasperdanielhansen opened 6 years ago

kasperdanielhansen commented 6 years ago

From Epigenomics Forum, posted by Vinícius Daguano Gastaldi vinidg@gmail.com

Dear all,

I'm trying to use the estimateCellCounts() function with the FlowSorted.CordBlood.450k package. However, I keep having the following error:

Error in as(rv, class(v)) : 
  no method or default for coercing “logical” to “factor”

Can anyone help me with it? I've tried multiple versions of the FlowSorted.CordBlood.450k package, but none of them seem to work. The FlowSorted.Blood.450k works without any issues.

I'm currently running it with the following parameters:

cellCounts <- estimateCellCounts(RGSet,compositeCellType = "CordBlood",
                                 processMethod = "preprocessFunnorm", probeSelect = "auto",
                                 cellTypes = c("Bcell", "CD4T", "CD8T", "Gran", "Mono", "NK","nRBC"),
                                 referencePlatform = "IlluminaHumanMethylation450k",
                                 returnAll = TRUE, meanPlot = TRUE, verbose = TRUE, sex = targets$Sex, bgCorr = TRUE,
                                 dyeCorr = TRUE, keepCN = TRUE, ratioConvert = TRUE)

Thanks!

royfrancis commented 5 years ago

I have tried FlowSorted.Blood.450k,FlowSorted.Blood.EPIC and FlowSorted.CordBlood.450k using functions estimateCellCounts(), estimateCellCounts2() and estimateCellCounts.wlmn() (from wateRmelon package) and none of them work with my EPIC rg data.

I will discuss the FlowSorted.CordBlood.450k case below. These are my input and reference objects:

> rgf
class: RGChannelSet 
dim: 971070 477 
metadata(0):
assays(2): Green Red
rownames(971070): 1600101 1600111 ... 99810978 99810992
rowData names(0):
colnames(477): 202884930002_R01C01 202884930002_R02C01 ...
  203045470096_R07C01 203045470096_R08C01
colData names(8): Sample_Name Sample_Well ... Sentrix_Position
  Sentrix_ID
Annotation
  array: IlluminaHumanMethylationEPIC
  annotation: ilm10b2.hg19

> FlowSorted.CordBlood.450k
class: RGChannelSet 
dim: 622399 104 
metadata(0):
assays(2): Green Red
rownames(622399): 10600313 10600322 ... 74810490 74810492
rowData names(0):
colnames(104): 3999984027_R01C01 3999984027_R02C01 ...
  3999492054_R05C02 3999492054_R06C02
colData names(13): X Plate_ID ... predictedSex CellType
Annotation
  array: IlluminaHumanMethylation450k
  annotation: ilmn12.hg19

I keep getting this error:

cc <- estimateCellCounts(rgf,compositeCellType="CordBlood",
                         cellTypes=unique(pData(FlowSorted.CordBlood.450k)$CellType))

Loading required package: IlluminaHumanMethylation450kmanifest
Loading required package: IlluminaHumanMethylationEPICmanifest
[estimateCellCounts] Combining user data with reference (flow sorted) data.
Error in as(rv, class(v)) : 
  no method or default for coercing "logical" to "factor"

I tried removing pData with NAs and converted all variables from factors to characters and still don't work. Then perhaps it's factors or NA columns in the pData of the reference object. So, I created a modified version of estimateCellCounts() to pass in a modified FlowSorted.CordBlood.450k.

Now I get this error:

cc <- ecc(rgf,ref=FlowSorted.CordBlood.450k,compositeCellType="CordBlood",
          cellTypes=unique(pData(FlowSorted.CordBlood.450k)$CellType),
          returnAll=TRUE)

[estimateCellCounts] Combining user data with reference (flow sorted) data.
[estimateCellCounts] Processing user and reference data together.
[estimateCellCounts] Picking probes for composition estimation.
Error in p[trainingProbes, ] : subscript out of bounds

To me, this looked like it could be mismatch between probes as they are differing in number between my EPIC object and the 450k ref. Below I match the number of rows in input and reference.

rn <- intersect(rownames(FlowSorted.CordBlood.450k),rownames(rgf))
rgf1 <- rgf[rownames(rgf) %in% rn,]
fscb <- FlowSorted.CordBlood.450k[rownames(FlowSorted.CordBlood.450k) %in% rn,]

# same number of rows
> rgf1
class: RGChannelSet 
dim: 495 477 
metadata(0):
assays(2): Green Red
rownames(495): 10627500 10714330 ... 74748483 74787491
rowData names(0):
colnames(477): 202884930002_R01C01 202884930002_R02C01 ...
  203045470096_R07C01 203045470096_R08C01
colData names(8): Sample_Name Sample_Well ... Sentrix_Position
  Sentrix_ID
Annotation
  array: IlluminaHumanMethylationEPIC
  annotation: ilm10b2.hg19

> fscb
class: RGChannelSet 
dim: 495 104 
metadata(0):
assays(2): Green Red
rownames(495): 10627500 10714330 ... 74748483 74787491
rowData names(0):
colnames(104): 3999984027_R01C01 3999984027_R02C01 ...
  3999492054_R05C02 3999492054_R06C02
colData names(13): X Plate_ID ... predictedSex CellType
Annotation
  array: IlluminaHumanMethylation450k
  annotation: ilmn12.hg19

Now the error is the same, but slightly different text.

cc <- ecc(rgf1,ref=fscb,compositeCellType="CordBlood",
          cellTypes=unique(pData(fscb)$CellType),
          returnAll=TRUE)

[estimateCellCounts] Combining user data with reference (flow sorted) data.
[estimateCellCounts] Processing user and reference data together.
 Error in Meth[Green_probes, , drop = FALSE] : subscript out of bounds 

I also see that my input and the reference has different versions of the annotation. I am not sure what effect that has.

R version 3.5.0 (2018-04-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
minfi_1.26.2
rcavalcante commented 5 years ago

Hi,

I encountered this problem as well and when I stepped through the code manually, I got to a point where the probeList is:

List of 8
 $ Bcell     : chr [1:100] "cg17768768" "cg01598421" "cg25131632" "cg27106643" ...
 $ CD4T      : chr [1:100] "cg26852712" "cg16452866" "cg07015803" "cg07679948" ...
 $ CD8T      : chr [1:100] "cg08289375" "cg23752651" "cg20042612" "cg08691577" ...
 $ Gran      : chr [1:100] "cg05398700" "cg00660167" "cg25605731" "cg13785123" ...
 $ Mono      : chr [1:100] "cg26872907" "cg23244761" "cg12655112" "cg18990407" ...
 $ NK        : chr [1:100] "cg13617280" "cg05875239" "cg15956469" "cg27582527" ...
 $ nRBC      : chr [1:100] "cg25105522" "cg06768361" "cg05012676" "cg17052552" ...
 $ WholeBlood: chr [1:100] "cg04012618" "cg04315923" "NA" "NA.1" ...

Notice the NAs in the WholeBlood type. When I don't include WholeBlood in the cellTypes in the call to estimateCellCounts I don't get the Error in p[trainingProbes, ] : subscript out of bounds error.

So if you can go without WholeBlood, maybe you can move past this problem.

KevinMenden commented 5 years ago

If it's worth anything, I've encountered the same error as the original post here in this thread. I'm trying to deconvolve using the DLPFC reference, with pretty much basic parameters.

Did anybody found a solution to this error yet?

pryabinin commented 4 years ago

I think I found what the problem is. What rcavalcane noticed is correct, since some of the training probes are given the name NA, this means that the command: p[trainingProbes, ]

would cause an error. The reason why probes are called NA is on line 174 of pickCompProbes: c(rownames(yUp)[seq_len(numProbes)], rownames(yDown)[seq_len(numProbes)])

Here, the top 50 (since numProbes = 50) up regulated probes and top 50 down regulated probes are selected for each cell type. However, before this, the probes were filtered to have a p-value of < 1e-08 on line 170 of pickCompProbes: y <- x[x[, "p.value"] < 1e-08, ]

This means that if there are less than 50 down regulated probes or less than 50 up regulated probes with p-values < 1e-08 then line 174 will add NAs to the list of training probes and cause the error we see.

I think this error will occur for any input data set, because this error occurs with the example code provided with estimateCellCounts if you add in neutrophils to the list of cell types to be predicted (in additional to the default predicted cell types):

library(minfi)
library(FlowSorted.Blood.450k)
wh.WBC <- which(FlowSorted.Blood.450k$CellType == "WBC")
wh.PBMC <- which(FlowSorted.Blood.450k$CellType == "PBMC")
RGset <- FlowSorted.Blood.450k[, c(wh.WBC, wh.PBMC)]
## The following line is purely to work around an issue with repeated
## sampleNames and Biobase::combine()
sampleNames(RGset) <- paste(RGset$CellType,
                            c(seq(along = wh.WBC), seq(along = wh.PBMC)), sep = "_")
counts <- estimateCellCounts(RGset, meanPlot = FALSE,cellTypes = c("CD8T","CD4T", "NK","Bcell","Mono","Gran","Neu"))
round(counts, 2)
[estimateCellCounts] Combining user data with reference (flow sorted) data.

[estimateCellCounts] Processing user and reference data together.

[preprocessQuantile] Mapping to genome.
Loading required package: IlluminaHumanMethylation450kmanifest
Loading required package: IlluminaHumanMethylation450kanno.ilmn12.hg19
[preprocessQuantile] Fixing outliers.
[preprocessQuantile] Quantile normalizing.
[estimateCellCounts] Picking probes for composition estimation.

Error in p[trainingProbes, ] : subscript out of bounds
In addition: Warning messages:
1: In DataFrame(sampleNames = c(colnames(rgSet), colnames(referenceRGset)),  :
  'stringsAsFactors' is ignored
2: In .getSex(CN = CN, xIndex = xIndex, yIndex = yIndex, cutoff = cutoff) :
  An inconsistency was encountered while determining sex. One possibility is that only one sex is present. We recommend further checks, for example with the plotSex function.

Basically, there are not enough probes significantly associated with a cell type, and therefore NA is being inserted in place of proper probe names. The data is then being subset by the probe names and if a probe name is NA then that causes an error. This error rarely occurs when using estimateCellTypes with default parameters. NOTE: in the above example, neutrophils were added to the list of default cell types, however it ended up that granulocytes did not have enough significant probes, even though granulocytes do have enough significant probes if you exclude neutrophils

There are several "solutions" to this error but they all make assumptions. The most straightforward fix is to say that estimateCellCounts is not valid for any cell type set for which this error occurs. Another solution is to pick AT MOST 50 up- and down-regulated significant probes for each cell type (simply remove the NA values), however I do not know if the method is valid under these parameters.

hhx037 commented 4 years ago

So, I had a similar issue with estimateCellCounts2 in FlowSorted.Blood.EPIC, and I finally managed to sort it out.

The problem, errors like

no method or default for coercing "logical" to "factor"

arises when it attempts to match the data type between the user-provided RGset and the reference set, more specifically this line from estimateCellCounts2.R:

colData(referenceRGset)[commoncolumn] <- mapply(FUN = as, colData(referenceRGset)[commoncolumn], vapply(colData(rgSet)[commoncolumn], class, FUN.VALUE=character(1)), SIMPLIFY = FALSE)

So I tried to convert my data types, and also match my column names Age and Sex, as well as the coding for sex, to the reference:

colData(RGset)$Age = colData(RGset)$age # estimateCellCounts2 checks for Age column
colData(RGset)$Sex = colData(RGset)$sex # estimateCellCounts2 checks for Sex column
colData(RGset)$Sex = as.character(colData(RGset)$Sex)
colData(RGset)$Sex[which(colData(RGset)$Sex=="1")] = "M"
colData(RGset)$Sex[which(colData(RGset)$Sex=="2")] = "F"
colData(RGset)$Sample_Name = as.character(colData(RGset)$Sample_Name)
colData(RGset)$Sample_Plate = as.character(colData(RGset)$Sample_Plate)
colData(RGset)$Sample_Group = as.character(colData(RGset)$Sample_Group)
colData(RGset)$Array = as.character(colData(RGset)$Array)
colData(RGset)$Slide = as.numeric(colData(RGset)$Slide)

counts <- estimateCellCounts2m(RGset, compositeCellType = "Blood", processMethod = "auto", probeSelect = "auto", cellTypes = c("CD8T", "CD4T", "NK", "Bcell", "Mono", "Neu"), referencePlatform = "IlluminaHumanMethylationEPIC")

But that still failed! I found out that after passing RGset to the function, when checking the function's internal variable rgSet, my samples names were back to "integer", the sample_Group to "factor" etc. So I modified the function by adding a few steps to transform rgSet types:

original:

    if(is.null(rgSet$CellType))
        rgSet$CellType<-rep("NA", dim(rgSet)[2])
    if(is.null(rgSet$Age))
        rgSet$Age<-rep("NA", dim(rgSet)[2])
    if(is.null(rgSet$Sex))
        rgSet$Sex<-rep("NA", dim(rgSet)[2])

modified:

    if(is.null(rgSet$CellType))
        rgSet$CellType<-rep("NA", dim(rgSet)[2])
    if(is.null(rgSet$Age))
        rgSet$Age<-rep("NA", dim(rgSet)[2])
    if(is.null(rgSet$Sex))
        rgSet$Sex<-rep("NA", dim(rgSet)[2])
    if(!is.character(rgSet$Sample_Name))
        rgSet$Sample_Name = as.character(rgSet$Sample_Name)
    if(!is.character(rgSet$Sample_Plate))
        rgSet$Sample_Plate = as.character(rgSet$Sample_Plate)
    if(!is.character(rgSet$Sample_Group))
        rgSet$Sample_Group = as.character(rgSet$Sample_Group)
    if(!is.character(rgSet$Array))
        rgSet$Array = as.character(rgSet$Array)
    if(!is.numeric(rgSet$Slide))
        rgSet$Slide = as.numeric(rgSet$Slide)
    if(!is.character(rgSet$Sex))
        rgSet$Sex = as.character(rgSet$Sex)

that solved the issue and the function completed successfully! I'm attaching the modified function here, just source it and run it using estimateCellCounts2m (instead of estimateCellCounts2).

estimateCellCounts2_modified.zip

Hope this helps.

lferreiraMD commented 4 months ago

Basically, there are not enough probes significantly associated with a cell type, and therefore NA is being inserted in place of proper probe names. The data is then being subset by the probe names and if a probe name is NA then that causes an error. This error rarely occurs when using estimateCellTypes with default parameters. NOTE: in the above example, neutrophils were added to the list of default cell types, however it ended up that granulocytes did not have enough significant probes, even though granulocytes do have enough significant probes if you exclude neutrophils

Run into the same error message with 450K arrays. Excluding "Neu" from the cell type options "fixed" it. Code needs to be amended to test whether there are at least 50 (or N) hyper- and hypomethylated proobes for each cell type. Easier said than done; we're all in a hurry...