navinlabcode / copykat

Other
210 stars 55 forks source link

Use of matrix object instead of sparse matrix may lead to memory problems in very large datasets. #7

Closed enblacar closed 2 years ago

enblacar commented 3 years ago

Dear copyKAT developers,

I am trying to implement your package in my analysis. However, I noticed that the very first command asks to extract the raw count matrix from the Seurat object and turn it into a matrix object. However, out of experience, with datasets larger than 50.000 cells, turning the sparse matrix into a matrix object overloads the number of integers that can be stored (that can be checked using .Machine$integer.max, equal to 2147483647).

Is there a possibility to feed the package with the sparse raw count matrix instead? Or do we need to downsample our samples to get a suitable number of cells?

Many thanks for the feedback!

Best, Enblacar

gaobio commented 3 years ago

Hi enblacar, yes exactly. we have noticed this. With current throughput, it is just ok. You will only run one sample at a time, because it searches for internal control. But sparse matrix is a good point. We will replace with sparse matrix object...

guanxn90 commented 3 years ago

@gaobio @enblacar I recently use sparse matrix as input, and get results as well. Is the sparse matrix supported already? Using the same data, I run copykat using sparse or regular matrix, and I get different results. left (run with sparse matrix) and right (run with regular matrix). image pbmc <- readRDS('pbmc_diet.rds') # pbmc is a Seurat object exp.rawdata <- pbmc@assays$RNA@counts copykat.test <- copykat(rawmat=exp.rawdata, id.type="S", ngene.chr=5, win.size=25, KS.cut=0.15, sam.name="SX00600DBE_allcells", distance="euclidean", norm.cell.names="", n.cores=4)

Thanks a lot, Guan

hartlama commented 3 years ago

Hi @guanxn90

How did you run the data as a sparse matrix? I am inputting a sparse matrix for the rawmat argument (similar to your code above), but I am still getting the memory error. I think the copykat script tries to convert it to a dense matrix.

Thanks, Molly

copykat.test <- copykat(rawmat=exp.rawdata, id.type="S", ngene.chr=5, win.size=25, KS.cut=0.1, sam.name="mIDH", distance="euclidean" [1] "running copykat v1.0.4"s=8) [1] "step1: read and filter data ..." [1] "49056 genes, 62123 cells in raw data" Error in asMethod(object) : Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105 traceback() 6: asMethod(object) 5: as(x, "matrix") 4: as.matrix.Matrix(X) 3: as.matrix(X) 2: apply(rawmat, 2, function(x) (sum(x > 0))) 1: copykat(rawmat = exp.rawdata, id.type = "S", ngene.chr = 5, win.size = 25, KS.cut = 0.1, sam.name = "mIDH", distance = "euclidean", norm.cell.names = "", n.cores = 8)

gaobio commented 3 years ago

62123 cells in raw data

Echo one more time. One sample at a time. Combined samples would generate wrong results because copykat uses relative gene expression to calculate CNAs. It does not contain a module to normalize batch effect that may be calculated as CNAs.

hartlama commented 3 years ago

Thanks @gaobio !

jpeng2021 commented 3 years ago

@gaobio I am running copyKat using raw_feature_bc_matrix generated by cell ranger and loaded into a Seurat project for an individual sample. I got the same error message "Cholmod error ' problem too large" as discussed when I ran exp.rawdata <- as.matrix(raw@assays$RNA@counts). I switched to using filtered barcodes from filtered_feature_bc_matrix from cell ranger which is what cellranger called as cells. The program is able to run now. There are 6000 barcodes in filtered_feature_bc_matrix. Does copyKat want filtered barcodes (the barcodes called as cells) as input or does it only work with raw barcodes (all barcodes with UMI>1 including background) from cell ranger? In copyKat documentation, it only says "raw" without specifying this.

gaobio commented 2 years ago

@gaobio I am running copyKat using raw_feature_bc_matrix generated by cell ranger and loaded into a Seurat project for an individual sample. I got the same error message "Cholmod error ' problem too large" as discussed when I ran exp.rawdata <- as.matrix(raw@assays$RNA@counts). I switched to using filtered barcodes from filtered_feature_bc_matrix from cell ranger which is what cellranger called as cells. The program is able to run now. There are 6000 barcodes in filtered_feature_bc_matrix. Does copyKat want filtered barcodes (the barcodes called as cells) as input or does it only work with raw barcodes (all barcodes with UMI>1 including background) from cell ranger? In copyKat documentation, it only says "raw" without specifying this.

Sorry for the confusion in-between. The 'raw' in copykat means the original data matrix that is provided to the copykat function, which is not related to the 10X genomics output.