Loading count matrix from 10x dataset in DIscBIO

abhisheksinghnl commented 2 years ago

Dear Author,

Thank you for the excellent tool. I have tested the tool on the test dataset of CTC and it works flawlessly. I then tried to use it on a 10x dataset from GSE136103. It has 10 healthy and 10 diseased samples.

I have processed these using Seurat as follows

data.10x = list()
data.10x[[1]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Cirrhotic.Cellranger/Cirrhotic/GSM4041161/cellranger/")
data.10x[[2]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Cirrhotic.Cellranger/Cirrhotic/GSM4041162/cellranger/")
data.10x[[3]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Cirrhotic.Cellranger/Cirrhotic/GSM4041163/cellranger/")
data.10x[[4]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Cirrhotic.Cellranger/Cirrhotic/GSM4041164/cellranger/")
data.10x[[5]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Cirrhotic.Cellranger/Cirrhotic/GSM4041165/cellranger/")
data.10x[[6]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Cirrhotic.Cellranger/Cirrhotic/GSM4041166/cellranger/")
data.10x[[7]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Cirrhotic.Cellranger/Cirrhotic/GSM4041167/cellranger/")
data.10x[[8]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Cirrhotic.Cellranger/Cirrhotic/GSM4041168/cellranger/")
data.10x[[9]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Cirrhotic.Cellranger/Cirrhotic/GSM4041169/cellranger/")
data.10x[[10]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Healthy.Cellranger/Healthy/GSM4041150/cellranger/")
data.10x[[11]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Healthy.Cellranger/Healthy/GSM4041151/cellranger/")
data.10x[[12]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Healthy.Cellranger/Healthy/GSM4041152/cellranger/")
data.10x[[13]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Healthy.Cellranger/Healthy/GSM4041153/cellranger/")
data.10x[[14]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Healthy.Cellranger/Healthy/GSM4041154/cellranger/")
data.10x[[15]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Healthy.Cellranger/Healthy/GSM4041155/cellranger/")
data.10x[[16]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Healthy.Cellranger/Healthy/GSM4041156/cellranger/")
data.10x[[17]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Healthy.Cellranger/Healthy/GSM4041157/cellranger/")
data.10x[[18]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Healthy.Cellranger/Healthy/GSM4041158/cellranger/")
data.10x[[19]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Healthy.Cellranger/Healthy/GSM4041159/cellranger/")
data.10x[[20]] <- Read10X(data.dir = "/home/abhishek/UseCase/Liver/UseCaseLiver/Healthy.Cellranger/Healthy/GSM4041160/cellranger/")

### Create vector of sample names
samples = c("GSM4041150","GSM4041151","GSM4041152","GSM4041153","GSM4041154","GSM4041155","GSM4041156","GSM4041157","GSM4041158", "GSM4041159","GSM4041160",         "GSM4041161","GSM4041162","GSM4041163","GSM4041164","GSM4041165","GSM4041166","GSM4041167","GSM4041168","GSM4041169")

#Create Seurat Objects
scrna.list = list()
for (i in 1:length(data.10x)) {
  scrna.list[[i]] = CreateSeuratObject(counts = data.10x[[i]], min.cells=5, min.features=50, project=samples[i]);
  scrna.list[[i]][["DataSet"]] = samples[i];
}

rm(data.10x)

### Merge Seurat object into a single object
scrna <- merge(x=scrna.list[[1]], y=c(scrna.list[[2]],scrna.list[[3]],scrna.list[[4]],scrna.list[[5]],scrna.list[[6]],scrna.list[[7]],
                                      scrna.list[[8]],scrna.list[[9]],scrna.list[[10]],scrna.list[[11]],scrna.list[[12]],scrna.list[[13]],
                                      scrna.list[[14]],scrna.list[[15]],scrna.list[[16]],scrna.list[[17]],scrna.list[[18]],scrna.list[[19]],
                                      scrna.list[[20]]), add.cell.ids =  c("GSM4041161","GSM4041162",
                                      "GSM4041163","GSM4041164","GSM4041165","GSM4041166","GSM4041167","GSM4041168","GSM4041169",
                                      "GSM4041150","GSM4041151","GSM4041152","GSM4041153","GSM4041154","GSM4041155","GSM4041156",
                                      "GSM4041157","GSM4041158", "GSM4041159","GSM4041160"));

Then computed the count matrix per sample by taking average as below

avg.counts <- AverageExpression(object = scrna)
write.csv(avg.counts, file="Liver.csv")

Then I use DIscBIO pipeline as outlined for further analyses

library(DIscBIO)

Loading Dataset

FileName<-"liver" DataSet <- read.csv(file = paste0(FileName,".csv"), sep = ",",header=T) rownames(DataSet)<-DataSet[,1] DataSet<-(DataSet[,-1]) cat(paste0("The ", FileName," contains:","\n","Genes: ",length(DataSet[,1]),"\n","cells: ",length(DataSet[1,]),"\n")) sc<- DISCBIO(DataSet)

In the next step everything becomes infinity

S1<-summary(colSums(DataSet,na.rm=TRUE))            # It gives an idea about the number of reads across cells
print(S1)

I am not sure where I am going wrong and how can I fix this and in addition is DIscBIO capable of handling 10x data or it is only for FACS sorted and SMART-seq data?

Could you please guide me and give suggestion that could fix this problem.

Thank you

wleoncio commented 2 years ago

Hi there,

Thank you for your report. We'll try to reproduce the issue reported and get back to you ASAP. What's your output of packageVersion("DIscBIO")?

abhisheksinghnl commented 2 years ago

packageVersion("DIscBIO")

Hi there,

Thank you for your report. We'll try to reproduce the issue reported and get back to you ASAP. What's your output of packageVersion("DIscBIO")?

Hi There,

Thank you for your reply.

The package version is 1.2.0

Looking forward to using 10x data.

Thank you

wleoncio commented 2 years ago

Hi there,

The problem seems to be directly caused by a presence of Inf values on DataSet. Most values in there look fine to me (see summary(DataSet)), but the presence of a few infinite values (check table(DataSet == Inf)) makes colSums() return Inf.

Since na.rm only removes NAs, but not Infs, the following code works around the issue:

DataSet[DataSet == Inf] <- NA # maybe instead of NA, use a meaninful but finite value
S1 <- summary(colSums(DataSet, na.rm=TRUE)) # gives an idea about the number of reads across cells

My output after that was:

> print(S1)
      Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
2.886e+236 2.685e+284 3.313e+293 9.480e+302 5.593e+302 5.903e+303

I don't know if this makes sense methodologically, though.

Looking forward to your feedback!

wleoncio commented 2 years ago

Reproducible example available here: https://gist.github.com/wleoncio/a3f04d3bbe4ec89a1966c1b3c5be527e

wleoncio commented 2 years ago

Hi @abhisheksinghnl,

Could you please confirm if the solution presented above (https://github.com/ocbe-uio/DIscBIO/issues/41#issuecomment-1185254486) solves your issue?

Thanks!

wleoncio commented 12 months ago

Given the lack of feedback, I'll just assume the proposed solution fixes the issue. Please reopen if problem persists.

ocbe-uio / DIscBIO

Loading count matrix from 10x dataset in DIscBIO #41

Loading Dataset