theislab / destiny

R package for single cell and other data analysis using diffusion maps
https://theislab.github.io/destiny/
GNU General Public License v3.0
69 stars 12 forks source link

Segmentation fault with `DiffusionMap` #63

Open lucygarner opened 1 year ago

lucygarner commented 1 year ago

Hi,

I am getting a segmentation fault with DiffusionMap. I am not changing any of the defaults.

1) When inputting a data matrix as data, I get the following error:

caught segfault address 0x7f7675a30cb0, cause 'memory not mapped' Error: Could not call find_knn. Consider specifying knn_params = list(M = <larger number>). Original error: long vectors not supported yet: ../../src/include/Rinlinedfuns.h:537

2) When inputting a SingleCellExperiment object as data, I get the following error:

caught segfault address 0x7f9dc6fb2cb0, cause 'memory not mapped'

Traceback: 1: knn_asym(data, k, distance) 2: knn.covertree::find_knn(data, k, query = query, distance = distance, sym = sym) 3: (function (data, k, ..., query = NULL, distance = c("euclidean", "cosine", "rankcor", "l2"), method = c("covertree", "hnsw"), sym = TRUE, verbose = FALSE) { p <- utils::modifyList(formals(RcppHNSW::hnsw_knn), list(...)) method <- match.arg(method) distance <- match.arg(distance) if (!is.double(data)) { warning("find_knn does not yet support sparse matrices, converting data to a dense matrix.") data <- as.matrix(data) } if (method == "covertree") { return(knn.covertree::find_knn(data, k, query = query, distance = distance, sym = sym)) } if (distance == "rankcor") { distance <- "cosine" data <- rank_mat(data) if (!is.null(query)) query <- rank_mat(query) } if (is.null(query)) { knn <- hnsw_knn(data, k + 1L, distance, M = p$M, ef_construction = p$ef_construction, ef = p$ef, verbose = verbose) knn$idx <- knn$idx[, -1, drop = FALSE] knn$dist <- knn$dist[, -1, drop = FALSE] } else { index <- hnsw_build(data, distance, M = p$M, ef = p$ef_construction, verbose = verbose) knn <- hnsw_search(query, index, k, ef = p$ef, verbose = verbose) } names(knn)[[1L]] <- "index" knn$dist_mat <- sparseMatrix(rep(seq_len(nrow(knn$index)), k), as.vector(knn$index), x = as.vector(knn$dist), dims = c(nrow(if (is.null(query)) data else query), nrow(data))) if (is.null(query)) { if (sym) knn$dist_mat <- symmetricise(knn$dist_mat) nms <- rownames(data) } else { nms <- rownames(query) } rownames(knn$dist_mat) <- rownames(knn$index) <- rownames(knn$dist) <- nms colnames(knn$dist_mat) <- rownames(data) knn})(new("dgCMatrix", i = c(11854L, 32418L, 46422L, 42L, 100L, 173L, 285L, 293L, 419L, 504L, 629L, 694L, 743L, 777L, 835L, 1122L, 1183L, 1214L, 1259L, 1318L, 1382L, 1389L, 1402L, 1407L, 1655L, 1738L, 1779L, 1997L, 2008L, 2018L, 2023L, 2060L, 2204L, 2241L, 2416L, 2500L, 2558L, 2635L, 2690L, 2701L, 2715L, 2738L, 2742L, 2908L, 2982L, 3118L, 3119L, 3153L, 3311L, 3420L, 3566L, 3605L, 3691L, 3695L, 3715L, 3759L, 4015L, 4108L, 4164L, 4209L, 4260L, 4307L, 4319L, 4373L, 4649L, 4672L, 4702L, 4860L, 5361L, 5426L, 5593L, 5595L, 5638L, 5643L, 5675L, 5791L, 5934L, 5937L, 5942L, 6441L, 6442L, 6604L, 6714L, 6731L, 6740L, 6800L, 6844L, 6881L, 6906L, 6954L, 6984L, 7027L, 7033L, 7099L, 7177L, 7196L, 7260L, 7343L, 7356L, 7376L, 7569L, 7688L, 7831L, 7952L, 8024L, 8071L, 8097L, 8128L, 8131L, 8179L, 8207L, 8216L, 8444L, 8503L, 8527L, 8698L, 8718L, 8776L, 8820L, 8856L, 8987L, 8994L, 9116L, 9362L, 9363L, 9383L, 9449L, 9631L, 9686L, 9714L, 9750L, 9826L, 9873L, 10063L, 10079L, 10392L, 10400L, 10469L, 10504L, 10579L, 10600L, 10646L, 10866L, 10961L, 11055L, 11501L, 11511L, 11671L, 11780L, 11823L, 12115L, 12134L, 12242L, 12290L, 12353L, 12411L, 12544L, 12571L, 12890L, 12982L, 13013L, 13019L, 13029L, 13193L, 13259L, 13497L, 13548L, 13646L, 13704L, 13820L, 13896L, 13922L, 14016L, 14026L, 14045L, 14135L, 14158L, 14213L, 14221L, 14280L, 14368L, 14376L, 14390L, 14527L, 14598L, 14776L, 14850L, 14910L, 14942L, 15176L, 15356L, 15496L, 15505L, 15507L, 15566L, 15792L, 15824L, 15842L, 15951L, 16007L, 16331L, 16340L, 16345L, 16352L, 16406L, 16416L, 16471L, 16595L, 16656L, 16785L, 16869L, 16880L, 17217L, 17392L, 17461L, 17579L, 17582L, 17897L, 17948L, 18031L, 18195L, 18331L, 18378L, 18456L, 18459L, 18560L, 18590L, 18657L, 18820L, 18851L, 19034L, 19073L, 19181L, 19403L, 19689L, 19800L, 19851L, 19866L, 19918L, 19967L, 20026L, 20101L, 20104L, 20180L, 20225L, 20262L, 20549L, 20666L, 20737L, 20900L, 21116L, 21412L, 21725L, 21749L

I assume these errors are both down to the large size of my data (~100,000 cells x ~20000 genes) and the best approach would be to input PCA scores rather than the normalised expression values? Or is there another way around this?

Best wishes, Lucy

flying-sheep commented 1 year ago

Hi! I think you might be right:

long vectors not supported yet

might mean that your data is stored as a long vector, and something can’t deal with this.

Which R version are you using? If the Rinlinedfuns.h from your version is identical to the current trunk version, the error happens in this line: https://github.com/wch/r-source/blob/dac7eca95d50285a12addcf74ca42d82fc2bfe9b/src/include/Rinlinedfuns.h#L537 which looks weird: We can’t get the length of it?

What structure does your data matrix have? I assume it’s a sparse matrix, but it still has that many entries?

lucygarner commented 1 year ago

Hi,

No, it's a dense matrix. Are sparse matrices accepted? I was looking at dataset_extract_doublematrix (https://github.com/theislab/destiny/blob/master/R/dataset-helpers.r) and it appears to require either a matrix, data.frame, ExpressionSet, or SingleCellExperiment object.

Unless I use as.matrix() to convert my dgCMatrix into a dense matrix, is.matrix() gives FALSE.

I am using R 4.2.0 and destiny 3.12.0.

Best wishes, Lucy

flying-sheep commented 1 year ago

Oh! Focusing on scanpy must have led to me neglecting to finish convenient sparse matrix support here. I’m sorry!

What you can do is to use the distance matrix support. If you specify a “sparse distance matrix”* as distance parameter and NULL or a covariate dataframe as data, destiny will skip doing the KNN search itself.

covariates <- data.frame(...)  # cell metadata
dists <- N2R::Knn(data)  # I think N2R supports sparse data, but I don’t know
dm <- DiffusionMap(covariates, distance = dists)

*It’s a bit of an awkward format, as the non-specified entries in such a sparse matrix don’t stand for 0, but for “unknown large distance”.

lucygarner commented 1 year ago

Thank you. I tried to run N2R::Knn on my "dgCMatrix" (normalised expression), but got an error.

Error in n2Knn(m = m, k = k, nThreads = nThreads, verbose = verbose, indexType = indexType, : Not compatible with requested type: [type=S4; target=double].

I have got DiffusionMap working with PCA embeddings as data, so I will try this for now.