`scSHC` error: `Error in fitdistr(Qclusts2, "normal") : 'x' must be a non-empty numeric vector`

mschilli87 commented 2 months ago

I have a Seurat object that I extract the (sparse) count matrix from as follows:

count_mat <- LayerData(seurat_object, assay = "RNA", layer = "counts")

Next, I remove zero-sum rows (actually, I am even slightly more stringent than that):

count_mat <- count_mat[rowSums(count_mat) >= 3, ]

Finally, I also drop genes with low variance to prevent potential issues when factorizing the matrix:

# Warning: Depending in the dataset this can use *a lot* of RAM:
count_mat <- count_mat[resample::colVars(t(as.matrix(count_mat))) > .01, ]

However, even with all those safeguards in place, I cannot cluster the cells using scSHC:

# Warning: Adjust the number of cores, if necessary, before running this on your machine:
scSHC_results <- scSHC::scSHC(count_mat, cores = 16L)

This runs for a long time but eventually erorrs out with the following (uninformative) error message:

Error in fitdistr(Qclusts2, "normal") :
  'x' must be a non-empty numeric vector

Any advice on how to troubleshoot this further would be greatly appreciated.

igrabski commented 2 months ago

Hi Marcel, thanks for providing all this information. I have a few thoughts: first, what are the dimensions of your data count_mat? I noticed you are using a large number of cores, and I wonder if this is resulting in memory issues -- since mclapply can produce NULL values when memory issues are encountered, this may explain the error message you received. Do you encounter the same error if you turn off parallelization (parallel = F) (maybe subsampling the cells in your data if needed to control the runtime)? Second, although I see you are extracting the counts layer from your object, it may be worth verifying that these are indeed all integer counts -- I am not sure if the object was generated by yourself or someone else, but I have sometimes seen counts layers in Seurat objects that actually contain log-normalized values, which would result in an error.

mschilli87 commented 2 months ago

Hi Isabella. Thank you for getting back at me. My count_mat is quite large indeed (~ 20 K features x ~ 65 K cells; ~ 650 Mb as (dgCMatrix) sparse, and ~ 10.3 Gb as dense (matrix) matrix), but I did monitor the process during execution and I didn't see any excessive memory usage (I have almost 1/2 TB of RAM available on the noce I am running this). I certainly can try reducing the number of cores (or even switch to sequential processing) but since the runtime until hitting that error is already rather long using 16 parallel threads, I didn't feel like trying that earlier. 😅 I will do so though and report back to you. Regarding the counts, they are indeed integers (as in raw counts, I created the object myself), however, simply calling as.matrix on the dgCMatrix that I pass as count_mat (which I guess scSHC does internally) results in a (general) numeric matrix. I will explicitely pass it through as.integer for my next test, just to be sure (which, btw, also reduces the memory footprint of the dense matrix to just about half). Also if that is a problem, maybe a simple if(!all(is.integer(count_mat))) stop("count_mat must contain integers") in the beginning of the scSHC function could capture this problem and fail gracefully (as in fast and informative) rather than with a cryptic error after waiting for a long time. 😉

igrabski commented 2 months ago

Thanks so much! If possible (maybe with subsampling the cells to control computational runtime), it would be great if you could try with 1 core as this may result in a more direct error message (due to the behavior of mclapply when some cores encounter errors) whether or not there are memory issues. It sounds like the input should be fine based on your description, but I agree that maybe adding some simple check as you suggested could be helpful more broadly!

mschilli87 commented 2 months ago

I passed my dgCMatrix through as.matrix and set the resulting matrix'es mode to "integer" before passing it to scSHC with parallel=FALSE. After a long time, I got the following error:

Error in eigs_real_sym(A, nrow(A), k, which, sigma, opts, mattype = "sym_matrix",  : 
  dimension of 'A' must be at least 3

igrabski commented 2 months ago

Thanks! Would you be able to share your data with me so I can investigate further? My email address is igrabski[at]nygenome[dot]org.

On Tue, Sep 3, 2024, 5:37 PM Marcel Schilling @.***> wrote:

I passed my dgCMatrix through as.matrix and set the resulting matrix'es mode to "integer" before passing it to scSHC with parallel=FALSE. After a long time, I got the following error:

Error in eigs_real_sym(A, nrow(A), k, which, sigma, opts, mattype = "sym_matrix", : dimension of 'A' must be at least 3

— Reply to this email directly, view it on GitHub https://github.com/igrabski/sc-SHC/issues/25#issuecomment-2327485214, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANOBY2V4T2U3Q3ZC37QXGSDZUYT27AVCNFSM6AAAAABNKFIA62VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRXGQ4DKMRRGQ . You are receiving this because you commented.Message ID: @.***>

--

This message is for the recipient’s use only, and may contain confidential, privileged or protected information. Any unauthorized use or dissemination of this communication is prohibited. If you received this message in error, please immediately notify the sender and destroy all copies of this message. The recipient should check this email and any attachments for the presence of viruses, as we accept no liability for any damage caused by any virus transmitted by this email.

mschilli87 commented 2 months ago

Unfortunately I am not allowed to share these data.

igrabski commented 2 months ago

Hmm, is there any possibility of sharing the data if you scramble the cell barcodes and the gene names (importantly, the matrix is the same / untouched, but the cell and gene labels are randomized)? I have a couple ideas of what might be causing the error, but it is difficult without having a test case.

On Tue, Sep 3, 2024, 6:21 PM Marcel Schilling @.***> wrote:

Unfortunately I am not allowed to share these data.

— Reply to this email directly, view it on GitHub https://github.com/igrabski/sc-SHC/issues/25#issuecomment-2327543075, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANOBY2TSMWVN72KVQNSUOYTZUYY5BAVCNFSM6AAAAABNKFIA62VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRXGU2DGMBXGU . You are receiving this because you commented.Message ID: @.***>

--

This message is for the recipient’s use only, and may contain confidential, privileged or protected information. Any unauthorized use or dissemination of this communication is prohibited. If you received this message in error, please immediately notify the sender and destroy all copies of this message. The recipient should check this email and any attachments for the presence of viruses, as we accept no liability for any damage caused by any virus transmitted by this email.

mschilli87 commented 2 months ago

No, I generally have no permission to share these data. This is not up to me but I don't think there is a way around it. I can run more tests for you though.

igrabski commented 2 months ago

Gotcha, understood. If I send you a debugging script to save some internal quantities / outputs from the code as an RDA file, would you be comfortable sending me that output? It would just be some quantities produced by the code at the specific step where the error occurs, and would not include the data itself and the data would not be trace-able / recoverable from those quantities.

On Tue, Sep 3, 2024, 6:30 PM Marcel Schilling @.***> wrote:

No, I generally have no permission to share these data. This is not up to me but I don't think there is a way around it. I can run more tests for you though.

— Reply to this email directly, view it on GitHub https://github.com/igrabski/sc-SHC/issues/25#issuecomment-2327553126, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANOBY2SS5S7NZFHTQHH5HYTZUY2BZAVCNFSM6AAAAABNKFIA62VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRXGU2TGMJSGY . You are receiving this because you commented.Message ID: @.***>

--

This message is for the recipient’s use only, and may contain confidential, privileged or protected information. Any unauthorized use or dissemination of this communication is prohibited. If you received this message in error, please immediately notify the sender and destroy all copies of this message. The recipient should check this email and any attachments for the presence of viruses, as we accept no liability for any damage caused by any virus transmitted by this email.

igrabski commented 2 months ago

If that's possible, I've attached a script here. To run it, you should download this script, source it (do not call library(scSHC)!), and then run your scSHC command as you have been (feel free to use any number of cores, since that does not seem to relate to the problem).

This script does two things: first, it will print out some summary statistics (dimensions, rowMeans, etc.) of the subset of the data that is causing the error, and second, it will save an RDA file called output.rda containing some various internally computed quantities. If you are able, it would be great if you could share both these summary statistics and the RDA file. If you are not able to share, though, please run it anyway and I will at least ask you questions about the output (but of course it is much easier for me if I am able to look at these quantities myself). Thanks!

mschilli87 commented 2 months ago

Dear Isabella,

Thank you for your ongoing suppert.

This is what I get from your debugging script:

[1] 12924 64549
[1] 37
[1] 45
[1] 21
[1] 0
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
 0.00000  0.03658  0.14634  0.47336  0.41463 13.93902
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   1022    1164    1189    1183    1215    1268                                              
Error in if (test < alpha_level) { :                                                                                                  
 missing value where TRUE/FALSE needed

I'll send the output.rda file by mail.

igrabski commented 2 months ago

Thank you! I think I have identified the bug (a corner case) -- I made a new branch and pushed a fix. You can install this updated version using devtools::install_github("igrabski/sc-SHC",ref="on_genes_corner_case"). Please give this version a try (and for now keep the number of cores set to 1, to ensure a more explicit error message in case there is still a problem), and let me know if you still run into an issue!

igrabski commented 2 months ago

Apologies, that install command should be devtools::install_github("igrabski/sc-SHC",ref="on_genes_corner_case",force=TRUE).

mschilli87 commented 2 months ago

Dear Isabella,

With the version from that branch and parallel=FALSE I get

Error in eigs_real_sym(A, nrow(A), k, which, sigma, opts, mattype = "sym_matrix",  :
  dimension of 'A' must be at least 3

with my count_mat.

igrabski commented 2 months ago

Hmm, that seems odd. Just to double-check, are you running these with a seed set? And did the code run for the same amount of time before erroring out, or longer? I am pretty confident the fix should have addressed the specific case that came up with the debugging script, though maybe this error arose again at a later iteration with a different subset of the data.

On Fri, Sep 6, 2024, 5:04 AM Marcel Schilling @.***> wrote:

Dear Isabella,

With the version from that branch and parallel=FALSE I get

Error in eigs_real_sym(A, nrow(A), k, which, sigma, opts, mattype = "sym_matrix", : dimension of 'A' must be at least 3

with my count_mat.

— Reply to this email directly, view it on GitHub https://github.com/igrabski/sc-SHC/issues/25#issuecomment-2333595806, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANOBY2RTWQBTRXGRBU2BPPLZVFVZJAVCNFSM6AAAAABNKFIA62VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMZTGU4TKOBQGY . You are receiving this because you commented.Message ID: @.***>

--

This message is for the recipient’s use only, and may contain confidential, privileged or protected information. Any unauthorized use or dissemination of this communication is prohibited. If you received this message in error, please immediately notify the sender and destroy all copies of this message. The recipient should check this email and any attachments for the presence of viruses, as we accept no liability for any damage caused by any virus transmitted by this email.

igrabski commented 2 months ago

Assuming that it did fail at a later point, I do actually have one more thought -- I just updated that same branch, which you can again reinstall with devtools::install_github("igrabski/sc-SHC",ref="on_genes_corner_case",force=TRUE). If that doesn't work, I can send a new version of a debugging script. (And thanks again for your patience with this back-and-forth! It's difficult without being able to test on data myself, though I appreciate your help with running all these tests!)

igrabski commented 2 months ago

And actually, just to save time, here is a new version of the debugging script -- maybe it makes sense to try this first (which you can do with parallelization).

mschilli87 commented 2 months ago

I use set.seed and didn't measure runtime as I let it run overnight. Thank you for you help. I'll test the new debugging script next.

mschilli87 commented 2 months ago

The updated debugging script finished without error (using 16 cores). 🎉 I'll try updating from the branch and running the regular function to double-check. 🤞 Thank you for you help. ❤️ Is there any chance to get those changes in a new version release? That would make referencing it for publication and distributing it to other users on our server much easier.

igrabski commented 2 months ago

That is great to hear!!

Re: the new version release, would it work for you if I set these changes to a pre-release branch and give it a pre-release version 0.1.0.9001?

igrabski commented 1 month ago

All set with the prerelease branch!

mschilli87 commented 1 month ago

I don't see that prerelease anywhere. Also, I get more > 400 clusters, most of them with < 100 cells. Is that normal or could I still hit some corner case with my dataset?

igrabski commented 1 month ago

The prerelease branch (with version bump) is located here: https://github.com/igrabski/sc-SHC/tree/prerelease and can be installed by specifying ref="prerelease" in the install_github command.

Regarding the large number of clusters, that is not typical but I have seen it once before. You can check if there is anything unusual about those cells in the tiny clusters, but otherwise unfortunately it is possible this model/algorithm is not a good fit for this data.

On Sat, Sep 7, 2024, 9:53 AM Marcel Schilling @.***> wrote:

I don't see that prerelease anywhere. Also, I get more > 400 clusters, most of them with < 100 cells. Is that normal or could I still hit some corner case with my dataset?

— Reply to this email directly, view it on GitHub https://github.com/igrabski/sc-SHC/issues/25#issuecomment-2335194306, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANOBY2QV3NQ2Z34MSYF7FRTZVMANNAVCNFSM6AAAAABNKFIA62VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMZVGE4TIMZQGY . You are receiving this because you commented.Message ID: @.***>

--

This message is for the recipient’s use only, and may contain confidential, privileged or protected information. Any unauthorized use or dissemination of this communication is prohibited. If you received this message in error, please immediately notify the sender and destroy all copies of this message. The recipient should check this email and any attachments for the presence of viruses, as we accept no liability for any damage caused by any virus transmitted by this email.

mschilli87 commented 1 month ago

OK. Thank you for all your support. I'll leave this open as the bug is not fixed in any release but feel free to close the issue if you consider it done with the prerelease.

igrabski / sc-SHC

`scSHC` error: `Error in fitdistr(Qclusts2, "normal") : 'x' must be a non-empty numeric vector` #25