negative length vectors are not allowed

theislab / destiny

R package for single cell and other data analysis using diffusion maps

https://theislab.github.io/destiny/

GNU General Public License v3.0

69 stars 12 forks source link

negative length vectors are not allowed #21

Closed igordot closed 5 years ago

igordot commented 5 years ago

I am running DiffusionMap() on an expression matrix of 2,000 most variable genes. It works great for a few thousand cells. However, at around 50,000 (so 50,000x2,000 input matrix), I get this error:

finding knns......done. Time: 23276.85s
Calculating transition probabilities...Error in .local(x, y, ...) : negative length vectors are not allowed

I tried adding suppress_dpt = TRUE, which seemed like the only option that may help, but it did not. Is there any way around resolve that?

flying-sheep commented 5 years ago

Could you give me a traceback here? I have no idea which part of the code throws that.

igordot commented 5 years ago

Here is the more complete output.

> dm = DiffusionMap(t(exp_mat), n_eigs = 2, rotate = TRUE, verbose = TRUE, suppress_dpt = TRUE)
finding knns......done. Time: 22466.43s
Calculating transition probabilities...Error in .local(x, y, ...) : negative length vectors are not allowed
Timing stopped at: 456.3 136.2 592.5
> traceback()
15: .Call(dgeMatrix_crossprod, x, TRUE)
14: .local(x, y, ...)
13: tcrossprod(Matrix(sigma))
12: tcrossprod(Matrix(sigma))
11: .class1(object)
10: as(mat, "dsCMatrix")
9: withCallingHandlers(expr, message = function(c) invokeRestart("muffleMessage"))
8: suppressMessages(as(mat, "dsCMatrix")[mask])
7: m(tcrossprod(Matrix(sigma)))
6: no_censoring(dists, sigma, cb)
5: force(expr)
4: system.time({
       r <- force(expr)
   })
3: verbose_timing(verbose, "Calculating transition probabilities", 
       {
           if (censor) 
               censoring(imputed_data, sigma, dists, censor_val, 
                   censor_range, missing_range, cb)
           else no_censoring(dists, sigma, cb)
       })
2: transition_probabilities(imputed_data, sigma, knn$dist_mat, censor, 
       censor_val, censor_range, missing_range, verbose)
1: DiffusionMap(t(exp_mat), n_eigs = 2, rotate = TRUE, verbose = TRUE, 
       suppress_dpt = TRUE)

flying-sheep commented 5 years ago

The error is R not being able to allocate a giant vector. It’s happening here:

https://github.com/theislab/destiny/blob/fbf6f29b4f3eab54095b2c96da27173619babc52/R/diffusionmap.r#L364

And 2 lines above I wrote “TODO: optimize local sigma no-censoring case”. I think the reason is simply that this is the only line where there’s still a dense matrix being created, which blows up. I guess it’s possible to write this more efficiently. The code is basically:

huge_mat <- sigma %*% t(sigma)
S1 <- huge_mat[d2 != 0 & upper.tri.sparse(d2)]
huge_mat2 <- outer(sigma^2, sigma^2, `+`)
S1 <- huge_mat2[d2 != 0 & upper.tri.sparse(d2)]

A more efficient way to do it would probably be simply

# convert to a symmetric triplet matrix to have
# all non-zero upper-triangle coordinates
# TODO: check if this leaves the coordinates in the same order!
coords <- as(d2, 'dsTMatrix')

S1 <- sigma[coords@i+1] * sigma[coords@j+1]
S2 <- (sigma^2)[coords@i+1] + (sigma^2)[coords@j+1]

flying-sheep commented 5 years ago

Should be fixed in 1008274908db42e08e608236bab1df382795bf5e, can you check?

igordot commented 5 years ago

Looks the update solved the problem. The process completed and the results seem reasonable.

Thank you for the quick fix!

flying-sheep commented 5 years ago

Great!

igordot commented 4 years ago

Actually, I have an update on this issue. I tested the new version on a few different data sets. Most of them come out essentially the same, but there is one data set that produced very different results.

This is the result with 2.12:

And this is the with the GitHub version:

The first one is roughly what one would expect and the second one is not. I tried subsampling the number of cells. I am showing 5000 cells above. Reducing it to 4500 yielded a plot that looks normal with both versions, so it seems like the input data is mostly okay.

Do you know what may be causing this?