jlmelville / uwot

An R package implementing the UMAP dimensionality reduction method.
https://jlmelville.github.io/uwot/
GNU General Public License v3.0
321 stars 31 forks source link

Seurat UMAP crashing with high numbers of cells #126

Open ollieeknight opened 4 months ago

ollieeknight commented 4 months ago

I'm running into an issue where Seurat keeps crashing (segmentation fault) when trying to run UMAP on a relatively large seurat object:

> alldata
An object of class Seurat
36443 features across 342124 samples within 2 assays
Active assay: RNA (36385 features, 2000 variable features)
 25 layers present: counts.layer1, counts.layer2, scale.data, data.layer1, data.layer2
 1 other assay present: ADT
 2 dimensional reductions calculated: pca, harmony_rna
> alldata <- RunUMAP(alldata, reduction = 'harmony_rna', dims = 1:30, n.components = 2,
                   umap.method = 'uwot', verbose = T, reduction.name = 'umap_rna',
          return.model = TRUE)
Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session
UMAP will return its model
09:56:47 UMAP embedding parameters a = 0.9922 b = 1.112

09:56:47 Read 342124 rows and found 30 numeric columns
09:56:47 Using Annoy for neighbor search, n_neighbors = 30
09:56:47 Building Annoy index with metric = cosine, n_trees = 50
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
09:57:45 Writing NN index file to temp file /fast/users/knighto_c/scratch/tmp/RtmpSeGRfX/file119f64ddb5f11
09:57:45 Searching Annoy index using 64 threads, search_k = 3000
09:58:01 Annoy recall = 100%
09:58:07 Commencing smooth kNN distance calibration using 64 threads with target n_neighbors = 30
09:58:23 Initializing from normalized Laplacian + noise (using RSpectra)

Could you help me troubleshoot? Any advice would be really great. To make sure it's not a memory issue, the last environment I tested this in had 64cores and 600gb RAM.

For the sake of privacy for my work, I hid the names of the layers, but there is indeed 25 of them and the harmony integration (saved as 'harmony_rna') worked perfectly

the error message I get is (cut off 1., it's a bunch of numbers, I assume some kind of output matrix)

``` 0990436955, -0.00446925974529167, -0.00604131468398329, -0.0705538584465636, -0.0131259413785549, -0.00383610019588497, -0.000335999028038046, -0.00142667850009277, -0.00356449826816127, -0.0163156904478564, -0.00801140127040875, -0.00114604327326075, -0.0098531432215265, -0.0416964502518092, -0.00474997663372142, -0.000467554702717314, -0.00723439078328324, -0.0622144353404406, -0.00780773572334401, -0.0501536885948031, -0.0217347848781538, -0.0128093393437481, -0.00452871789141949, -0.00412176456564197, -0.0148101904326151, -0.00533779483503008, -0.0592126354918537, -0.0869179734529376, 1, -0.021372509824796, -0.0565559101604794, -0.0146930310342102, -0.0113959932734392, -0.0222464682847417, -0.0287239332616513, -0.0115110269560471, -0.0299895502329343, -0.00896123533752184, -0.0218367796829283, -0.0128746805731773, -0.0102791252479695, -0.0102054265969284, -0.00846198781123482, -0.0102546824573139, -0.00833878640564498, -0.0101889722910684, -0.0163619634557634, -0.0117671939315179, -0.00942037413351387, -0.0073451098376736, -0.00736121091787225, -0.0405225508244596, -0.0648896791347744, -0.0159972801985926, -0.0369320826355839, -0.0216466852356735, -0.126986757224209, -0.00811740652556548, -0.105180659868323, -0.0114647835422552, -0.0108825551784113, -0.0151054547045992, 1), factors = list()), 342124L, 3L, list(which = c(LM = 0L), ncv = 20, tol = 0.0001, maxitr = 1000, retvec = TRUE, user_initvec = FALSE, sigma = 0, use_lower = TRUE), 6L, PACKAGE = "RSpectra") 2: do.call(.Call, args = dot_call_args) 3: eigs_real_sym(A, nrow(A), k, which, sigma, opts, mattype = "sym_dgCMatrix", extra_args = list(use_lower = as.logical(lower))) 4: eigs_sym.dgCMatrix(L, k = k, which = "LM", sigma = 0, opt = opt) 5: RSpectra::eigs_sym(L, k = k, which = "LM", sigma = 0, opt = opt) 6: doTryCatch(return(expr), name, parentenv, handler) 7: tryCatchOne(expr, names, parentenv, handlers[[1L]]) 8: tryCatchList(expr, classes, parentenv, handlers) 9: tryCatch(RSpectra::eigs_sym(L, k = k, which = "LM", sigma = 0, opt = opt), error = function(c) { NULL}) 10: withCallingHandlers(expr, warning = function(w) if (inherits(w, classes)) tryInvokeRestart("muffleWarning")) 11: suppressWarnings(res <- tryCatch(RSpectra::eigs_sym(L, k = k, which = "LM", sigma = 0, opt = opt), error = function(c) { NULL})) 12: rspectra_eigs_sym(L, ndim, verbose = verbose) 13: rspectra_normalized_laplacian_init(A, ndim, verbose = FALSE) 14: spectral_init(V, ndim = n_components, verbose = verbose) 15: uwot(X = X, n_neighbors = n_neighbors, n_components = n_components, metric = metric, n_epochs = n_epochs, alpha = learning_rate, scale = scale, init = init, init_sdev = init_sdev, spread = spread, min_dist = min_dist, set_op_mix_ratio = set_op_mix_ratio, local_connectivity = local_connectivity, bandwidth = bandwidth, gamma = repulsion_strength, negative_sample_rate = negative_sample_rate, a = a, b = b, nn_method = nn_method, n_trees = n_trees, search_k = search_k, method = "umap", approx_pow = approx_pow, n_threads = n_threads, n_sgd_threads = n_sgd_threads, grain_size = grain_size, y = y, target_n_neighbors = target_n_neighbors, target_weight = target_weight, target_metric = target_metric, pca = pca, pca_center = pca_center, pca_method = pca_method, pcg_rand = pcg_rand, fast_sgd = fast_sgd, ret_model = ret_model || "model" %in% ret_extra, ret_nn = ret_nn || "nn" %in% ret_extra, ret_fgraph = "fgraph" %in% ret_extra, ret_sigma = "sigma" %in% ret_extra, ret_localr = "localr" %in% ret_extra, batch = batch, opt_args = opt_args, epoch_callback = epoch_callback, binary_edge_weights = binary_edge_weights, tmpdir = tempdir(), verbose = verbose, dens_scale = dens_scale, seed = seed, nn_args = nn_args) 16: umap(X = object, n_threads = nbrOfWorkers(), n_neighbors = as.integer(x = n.neighbors), n_components = as.integer(x = n.components), metric = metric, n_epochs = n.epochs, learning_rate = learning.rate, min_dist = min.dist, spread = spread, set_op_mix_ratio = set.op.mix.ratio, local_connectivity = local.connectivity, repulsion_strength = repulsion.strength, negative_sample_rate = negative.sample.rate, a = a, b = b, fast_sgd = uwot.sgd, verbose = verbose, ret_model = return.model) 17: RunUMAP.default(object = data.use, reduction.model = reduction.model, return.model = return.model, assay = assay, umap.method = umap.method, n.neighbors = n.neighbors, n.components = n.components, metric = metric, n.epochs = n.epochs, learning.rate = learning.rate, min.dist = min.dist, spread = spread, set.op.mix.ratio = set.op.mix.ratio, local.connectivity = local.connectivity, repulsion.strength = repulsion.strength, negative.sample.rate = negative.sample.rate, a = a, b = b, uwot.sgd = uwot.sgd, seed.use = seed.use, metric.kwds = metric.kwds, angular.rp.forest = angular.rp.forest, densmap = densmap, dens.lambda = dens.lambda, dens.frac = dens.frac, dens.var.shift = dens.var.shift, reduction.key = reduction.key %||% Key(object = reduction.name, quiet = TRUE), verbose = verbose) 18: RunUMAP(object = data.use, reduction.model = reduction.model, return.model = return.model, assay = assay, umap.method = umap.method, n.neighbors = n.neighbors, n.components = n.components, metric = metric, n.epochs = n.epochs, learning.rate = learning.rate, min.dist = min.dist, spread = spread, set.op.mix.ratio = set.op.mix.ratio, local.connectivity = local.connectivity, repulsion.strength = repulsion.strength, negative.sample.rate = negative.sample.rate, a = a, b = b, uwot.sgd = uwot.sgd, seed.use = seed.use, metric.kwds = metric.kwds, angular.rp.forest = angular.rp.forest, densmap = densmap, dens.lambda = dens.lambda, dens.frac = dens.frac, dens.var.shift = dens.var.shift, reduction.key = reduction.key %||% Key(object = reduction.name, quiet = TRUE), verbose = verbose) 19: RunUMAP.Seurat(alldata, reduction = "harmony_rna", dims = 1:30, n.components = 2, umap.method = "uwot", verbose = T, reduction.name = "umap_rna", return.model = TRUE) 20: RunUMAP(alldata, reduction = "harmony_rna", dims = 1:30, n.components = 2, umap.method = "uwot", verbose = T, reduction.name = "umap_rna", return.model = TRUE) Possible actions: 1: abort (with core dump, if enabled) 2: normal R exit 3: exit R without saving workspace 4: exit R saving workspace Selection: ```
SamGG commented 4 months ago

Hi, The problem arises from the initialisation phase of umap, which is "spectral" by default. You don't have option to change this as far I understand Seurat code. This is not a problem of memory size as the data passed to uwot::umap originate from the reduction slot, here "harmony_rna" (the data matrix is from "Read 342124 rows and found 30 numeric columns"). I don't know this transform. If you have chance to perform a PCA or sparse PCA instead, this is what I would try. My two cents, Samuel

ollieeknight commented 4 months ago

thanks for your response, I appreciate it. I'm also trying to figure out which package update has caused this issue, as I previously had it working fine for even larger datasets.

running this, it now seems to have worked well:

alldata_umap <- uwot::umap2(alldata[['harmony_rna']]@cell.embeddings,
                            n_neighbors = 30, n_components = 2, 
                            metric = 'cosine', min_dist = 0.3, ret_model = T, 
                            n_threads = 32, verbose = T)
alldata[['umap_rna']] <- CreateDimReducObject(embeddings = alldata_umap$embedding,
                                                        key = 'harmonyrna_', 
                                                        assay = 'RNA')
alldata[['umap_rna']]@misc$model <- alldata_umap

although I can't seem to figure out why it isn't working in Seurat. I'll open an bug issue there, unless you have any suggestions first!

SamGG commented 4 months ago

Great to know that you are able to dive into the code.

jlmelville commented 4 months ago

@SamGG you can read more about umap2 at https://jlmelville.github.io/uwot/articles/umap2.html.

@ollieeknight can you try repeating the call to umap2 that completed but also add verbose = TRUE? Then repeat this but use the umap function and see if it segfaults? At any rate the seeing the output for both cases might be helpful.

ollieeknight commented 4 months ago

@jlmelville thanks picking this up, your input here is really appreciated, as well as your constant development on this package. thanks @SamGG also for your advice.

Here is the output of umap2:

11:22:24 Using HNSW for nearest neighbor search
11:22:24 UMAP embedding parameters a = 0.9922 b = 1.112
11:22:24 Read 342124 rows and found 50 numeric columns
11:22:24 Building HNSW index with metric 'cosine' ef = 200 M = 16 using 32 threads
11:22:30 Finished building index
11:22:30 Searching HNSW index with ef = 30 and 32 threads
11:22:31 Finished searching
11:22:33 Commencing smooth kNN distance calibration using 32 threads with target n_neighbors = 30
11:22:44 Initializing from normalized Laplacian + noise (using RSpectra)
11:27:00 Range-scaling initial input columns to 0-10
11:27:04 Commencing optimization for 500 epochs, with 15578248 positive edges
Using method 'umap'
Optimizing with Adam alpha = 1 beta1 = 0.5 beta2 = 0.9 eps = 1e-07
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|

umap still aborts, but umap2 runs to completion and is successful.

jlmelville commented 4 months ago

The first output says:

Read 342124 rows and found 30 numeric columns

Your second output with umap2 says:

Read 342124 rows and found 50 numeric columns

so there's something different there. @ollieeknight can you confirm that when you run with umap it also reads 50 numeric columns?

Based on the current output, umap2 is using HNSW for nearest neighbor search, rather than annoy so it could be that using HNSW is giving a slightly different input graph to RSpectra. Unfortunately, without looking at the data directly it will be very hard for me to diagnose what's happening. Either the neighbor graph is causing a problem for RSpectra or due to a failure of the nearest neighbor search, the input data is somehow completely wrong and uwot needs to detect this and stop.

You could try running umap with nn_method = "hnsw" or in the case of using Annoy, increase one or both of search_k and n_trees to see if a better neighbor graph helps, but I don't know if those parameters are exposed in Seurat.