mhahsler / dbscan

Density Based Clustering of Applications with Noise (DBSCAN) and Related Algorithms - R package
GNU General Public License v3.0
311 stars 65 forks source link

frNN object created from scratch couldn't be used in dbscan #49

Closed yuntianf closed 9 months ago

yuntianf commented 2 years ago

I have a large distance matrix and want to first build frNN object from scratch to reduce memory burden. I first initialize a frNN object with one node and then add my distance and node id to this object.

frNN_dis <- function(dis,eps = 0,if_direct = T){
    if(if_direct){
        dis <- rbind(dis,dis[,c(2,1,3)])
    }
    dis <- as.data.frame(dis)
    colnames(dis) <- c("umi1","umi2","dis")
    dis <- dis[order(dis$umi1),]
    out_frNN <- frNN(as.dist(1), eps = 5)
    out_frNN$dist <- split(dis$dis,dis$umi1)
    out_frNN$id <- split(dis$umi2,dis$umi1)
    out_frNN$sort <- F
    out_frNN <- frNN(out_frNN,eps = eps)
    return(out_frNN)
}

But when I used frNN object built in this way in dbscan, it caused a segfault error

*** caught segfault ***
address 0x51, cause 'memory not mapped'

Traceback:
 1: dbscan_int(x, as.double(eps), as.integer(minPts), as.double(weights),     as.integer(borderPoints), as.integer(search), as.integer(bucketSize),     as.integer(splitRule), as.double(approx), frNN)
 2: dbscan(cpp_test_nn, eps = 0, minPts = 1)
 3: eval(expr, envir, enclos)
 4: eval(expr, envir, enclos)
 5: withVisible(eval(expr, envir, enclos))
 6: withCallingHandlers(withVisible(eval(expr, envir, enclos)), warning = wHandler,     error = eHandler, message = mHandler)
 7: doTryCatch(return(expr), name, parentenv, handler)
 8: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 9: tryCatchList(expr, classes, parentenv, handlers)
10: tryCatch(expr, error = function(e) {    call <- conditionCall(e)    if (!is.null(call)) {        if (identical(call[[1L]], quote(doTryCatch)))             call <- sys.call(-4L)        dcall <- deparse(call)[1L]        prefix <- paste("Error in", dcall, ": ")        LONG <- 75L        sm <- strsplit(conditionMessage(e), "\n")[[1L]]        w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w")        if (is.na(w))             w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L],                 type = "b")        if (w > LONG)             prefix <- paste0(prefix, "\n  ")    }    else prefix <- "Error : "    msg <- paste0(prefix, conditionMessage(e), "\n")    .Internal(seterrmessage(msg[1L]))    if (!silent && isTRUE(getOption("show.error.messages"))) {        cat(msg, file = outFile)        .Internal(printDeferredWarnings())    }    invisible(structure(msg, class = "try-error", condition = e))})
11: try(f, silent = TRUE)
12: handle(ev <- withCallingHandlers(withVisible(eval(expr, envir,     enclos)), warning = wHandler, error = eHandler, message = mHandler))
13: timing_fn(handle(ev <- withCallingHandlers(withVisible(eval(expr,     envir, enclos)), warning = wHandler, error = eHandler, message = mHandler)))
14: evaluate_call(expr, parsed$src[[i]], envir = envir, enclos = enclos,     debug = debug, last = i == length(out), use_try = stop_on_error !=         2L, keep_warning = keep_warning, keep_message = keep_message,     output_handler = output_handler, include_timing = include_timing)
15: evaluate(request$content$code, envir = .GlobalEnv, output_handler = oh,     stop_on_error = 1L)
16: doTryCatch(return(expr), name, parentenv, handler)
17: tryCatchOne(expr, names, parentenv, handlers[[1L]])
18: tryCatchList(expr, names[-nh], parentenv, handlers[-nh])
19: doTryCatch(return(expr), name, parentenv, handler)
20: tryCatchOne(tryCatchList(expr, names[-nh], parentenv, handlers[-nh]),     names[nh], parentenv, handlers[[nh]])
21: tryCatchList(expr, classes, parentenv, handlers)
22: tryCatch(evaluate(request$content$code, envir = .GlobalEnv, output_handler = oh,     stop_on_error = 1L), interrupt = function(cond) {    log_debug("Interrupt during execution")    interrupted <<- TRUE}, error = .self$handle_error)
23: executor$execute(msg)
24: handle_shell()
25: kernel$run()
26: IRkernel::main()
An irrecoverable exception occurred. R is aborting now ...
mhahsler commented 2 years ago

Hi, it looks like your frNN object is corrupt. I have now implemented a less memory-hungry implementation for the conversion from dist to frNN in the function frNN(). It is now also used directly by dbscan().

Please install the latest development version from GitHub and check if frNN(dis, eps = xxx) works for you.

yuntianf commented 2 years ago

Thanks! But I do have a very large sparse distance matrix, and I don't want to store it as a dist object, but build a frNN object from scratch instead. I'm wondering if it is feasible.

mhahsler commented 2 years ago

Yes, it is possible to directly convert a sparse representation of a symmetric distance matrix into a frNN object.
How is your sparse distance matrix stored? Using the Matrix package (more specifically the dsCMatrix class)? Do you use a package to create the sparse distance matrix?

yuntianf commented 2 years ago

No I didn't use any package, I just store the sparse distance as a long table, each row representing the distance between 2 nodes, if dsCMatrix could work for frNN, I will have a try, thanks!

yuntianf commented 2 years ago

Hi, I found that even I feed frNN() with a dsCMatrix, that function will transform it to normal matrix

    if (!.matrixlike(x)) 
        stop("x needs to be a matrix to calculate distances")
    x <- as.matrix(x)

For some really large matrix this step will still occupy too much memory, and this step will also fill sparse matrix with 0, which is usually inf in a distance matrix, and this may cause fome problems.

mhahsler commented 2 years ago

That is correct. So you use a triplet format for all non-infinite entries. This definitely can be converted into an frNN object without losing the sparseness. All that needs to be done is remove all the rows that are > eps and then collect the indices and distances into a frNN object. I guess that is what your code tries to do, but it seems not to work. If there was a standard sparse distance representation, then I would incorporate it directly into the package, but it seems to be not the case.

For similarities, a sparse representation where 0s are dropped would probably be more natural.