segfault when converting h5ad to SCE

joseph-siefert commented 1 year ago

Thanks for the great tool. Unfortunately I am getting a segmentation error when converting a large dataset. A smaller subset works without issue. Here is the error from the large dataset:



 *** caught segfault ***
address 0x2ae45ba3f000, cause 'memory not mapped'

Traceback:
 1: py_ref_to_r(x)
 2: py_to_r.default(x)
 3: NextMethod()
 4: py_to_r.numpy.ndarray(x)
 5: py_to_r(x)
 6: as_r_value(x$indices)
 7: .nextMethod(.Object = .Object, ... = ...)
 8: callNextMethod()
 9: initialize(value, ...)
10: initialize(value, ...)
11: new("dgRMatrix", j = as.integer(as_r_value(x$indices)), p = as.integer(as_r_value(x$indptr)),     x = as.vector(as_r_value(x$data)), Dim = as.integer(dim(x)))
12: py_to_r.scipy.sparse.csr.csr_matrix(mat)
13: py_to_r(mat)
14: t(py_to_r(mat))
15: doTryCatch(return(expr), name, parentenv, handler)
16: tryCatchOne(expr, names, parentenv, handlers[[1L]])
17: tryCatchList(expr, classes, parentenv, handlers)
18: tryCatch(expr, error = function(e) {    call <- conditionCall(e)    if (!is.null(call)) {        if (identical(call[[1L]], quote(doTryCatch)))             call <- sys.call(-4L)        dcall <- deparse(call, nlines = 1L)        prefix <- paste("Error in", dcall, ": ")        LONG <- 75L        sm <- strsplit(conditionMessage(e), "\n")[[1L]]        w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w")        if (is.na(w))             w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L],                 type = "b")        if (w > LONG)             prefix <- paste0(prefix, "\n  ")    }    else prefix <- "Error : "    msg <- paste0(prefix, conditionMessage(e), "\n")    .Internal(seterrmessage(msg[1L]))    if (!silent && isTRUE(getOption("show.error.messages"))) {        cat(msg, file = outFile)        .Internal(printDeferredWarnings())    }    invisible(structure(msg, class = "try-error", condition = e))})
19: try(t(py_to_r(mat)), silent = TRUE)
20: .extract_or_skip_assay(skip_assays = skip_assays, hdf5_backed = hdf5_backed,     dims = dims, mat = adata$X, name = "'X' matrix")
21: AnnData2SCE(adata, X_name = X_name, hdf5_backed = backed, verbose = verbose,     ...)
22: fun(...)
23: basiliskRun(env = env, fun = .H5ADreader, file = file, X_name = X_name,     backed = use_hdf5, verbose = verbose, ...)
24: readH5AD("full_dataset.h5ad",     verbose = TRUE)
An irrecoverable exception occurred. R is aborting now ...
/cm/local/apps/uge/var/spool/chbscl-50-10/job_scripts/10091596: line 23: 79746 Segmentation fault      (core dumped)```

lazappi commented 1 year ago

Hi @joseph-siefert

Thanks for giving {zellkonverter} a go! Are you able to share the dataset at all? It's a bit hard to say if this is a dataset issue or something to do with your setup. I assume you don't have any issues reading the file in Python?

joseph-siefert commented 1 year ago

Unfortunately I can't share the dataset. I removed the layers and was able to get past the above error, however a new error arose: 'X' matrix does not support transposition and has been skipped I think both errors are due to memory limitations, as I was able to circumvent this by subsampling the matrix. I have plenty of available memory, so it seems related to the available memory in R during the matrix conversion. Is that another more memory-efficient way to make the conversion and avoid R memory limits?

lazappi commented 1 year ago

Sorry for the slow response. I thought I had replied to this but obviously not. The way the conversion works there are two copies of the data in memory, one in Python and one in R so for large datasets the memory requirement can be large. One approach to try is using the HDF5 backed mode which should help with this. The other thing is I made some fixes for this specific message recently so it might be worth trying the most recent version (see #96).

theislab / zellkonverter

segfault when converting h5ad to SCE #95