Open mschilli87 opened 2 months ago
Hi Marcel, thanks for providing all this information. I have a few thoughts: first, what are the dimensions of your data count_mat
? I noticed you are using a large number of cores, and I wonder if this is resulting in memory issues -- since mclapply
can produce NULL
values when memory issues are encountered, this may explain the error message you received. Do you encounter the same error if you turn off parallelization (parallel = F
) (maybe subsampling the cells in your data if needed to control the runtime)? Second, although I see you are extracting the counts
layer from your object, it may be worth verifying that these are indeed all integer counts -- I am not sure if the object was generated by yourself or someone else, but I have sometimes seen counts
layers in Seurat objects that actually contain log-normalized values, which would result in an error.
Hi Isabella. Thank you for getting back at me. My count_mat
is quite large indeed (~ 20 K features x ~ 65 K cells; ~ 650 Mb as (dgCMatrix
) sparse, and ~ 10.3 Gb as dense (matrix
) matrix), but I did monitor the process during execution and I didn't see any excessive memory usage (I have almost 1/2 TB of RAM available on the noce I am running this). I certainly can try reducing the number of cores (or even switch to sequential processing) but since the runtime until hitting that error is already rather long using 16 parallel threads, I didn't feel like trying that earlier. 😅 I will do so though and report back to you.
Regarding the counts, they are indeed integers (as in raw counts, I created the object myself), however, simply calling as.matrix
on the dgCMatrix
that I pass as count_mat
(which I guess scSHC
does internally) results in a (general) numeric
matrix. I will explicitely pass it through as.integer
for my next test, just to be sure (which, btw, also reduces the memory footprint of the dense matrix to just about half). Also if that is a problem, maybe a simple if(!all(is.integer(count_mat))) stop("count_mat must contain integers")
in the beginning of the scSHC function could capture this problem and fail gracefully (as in fast and informative) rather than with a cryptic error after waiting for a long time. 😉
Thanks so much! If possible (maybe with subsampling the cells to control computational runtime), it would be great if you could try with 1 core as this may result in a more direct error message (due to the behavior of mclapply
when some cores encounter errors) whether or not there are memory issues. It sounds like the input should be fine based on your description, but I agree that maybe adding some simple check as you suggested could be helpful more broadly!
I passed my dgCMatrix
through as.matrix
and set the resulting matrix
'es mode
to "integer"
before passing it to scSHC
with parallel=FALSE
.
After a long time, I got the following error:
Error in eigs_real_sym(A, nrow(A), k, which, sigma, opts, mattype = "sym_matrix", :
dimension of 'A' must be at least 3
Thanks! Would you be able to share your data with me so I can investigate further? My email address is igrabski[at]nygenome[dot]org.
On Tue, Sep 3, 2024, 5:37 PM Marcel Schilling @.***> wrote:
I passed my dgCMatrix through as.matrix and set the resulting matrix'es mode to "integer" before passing it to scSHC with parallel=FALSE. After a long time, I got the following error:
Error in eigs_real_sym(A, nrow(A), k, which, sigma, opts, mattype = "sym_matrix", : dimension of 'A' must be at least 3
— Reply to this email directly, view it on GitHub https://github.com/igrabski/sc-SHC/issues/25#issuecomment-2327485214, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANOBY2V4T2U3Q3ZC37QXGSDZUYT27AVCNFSM6AAAAABNKFIA62VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRXGQ4DKMRRGQ . You are receiving this because you commented.Message ID: @.***>
--
This message is for the recipient’s use only, and may contain confidential, privileged or protected information. Any unauthorized use or dissemination of this communication is prohibited. If you received this message in error, please immediately notify the sender and destroy all copies of this message. The recipient should check this email and any attachments for the presence of viruses, as we accept no liability for any damage caused by any virus transmitted by this email.
Unfortunately I am not allowed to share these data.
Hmm, is there any possibility of sharing the data if you scramble the cell barcodes and the gene names (importantly, the matrix is the same / untouched, but the cell and gene labels are randomized)? I have a couple ideas of what might be causing the error, but it is difficult without having a test case.
On Tue, Sep 3, 2024, 6:21 PM Marcel Schilling @.***> wrote:
Unfortunately I am not allowed to share these data.
— Reply to this email directly, view it on GitHub https://github.com/igrabski/sc-SHC/issues/25#issuecomment-2327543075, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANOBY2TSMWVN72KVQNSUOYTZUYY5BAVCNFSM6AAAAABNKFIA62VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRXGU2DGMBXGU . You are receiving this because you commented.Message ID: @.***>
--
This message is for the recipient’s use only, and may contain confidential, privileged or protected information. Any unauthorized use or dissemination of this communication is prohibited. If you received this message in error, please immediately notify the sender and destroy all copies of this message. The recipient should check this email and any attachments for the presence of viruses, as we accept no liability for any damage caused by any virus transmitted by this email.
No, I generally have no permission to share these data. This is not up to me but I don't think there is a way around it. I can run more tests for you though.
Gotcha, understood. If I send you a debugging script to save some internal quantities / outputs from the code as an RDA file, would you be comfortable sending me that output? It would just be some quantities produced by the code at the specific step where the error occurs, and would not include the data itself and the data would not be trace-able / recoverable from those quantities.
On Tue, Sep 3, 2024, 6:30 PM Marcel Schilling @.***> wrote:
No, I generally have no permission to share these data. This is not up to me but I don't think there is a way around it. I can run more tests for you though.
— Reply to this email directly, view it on GitHub https://github.com/igrabski/sc-SHC/issues/25#issuecomment-2327553126, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANOBY2SS5S7NZFHTQHH5HYTZUY2BZAVCNFSM6AAAAABNKFIA62VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRXGU2TGMJSGY . You are receiving this because you commented.Message ID: @.***>
--
This message is for the recipient’s use only, and may contain confidential, privileged or protected information. Any unauthorized use or dissemination of this communication is prohibited. If you received this message in error, please immediately notify the sender and destroy all copies of this message. The recipient should check this email and any attachments for the presence of viruses, as we accept no liability for any damage caused by any virus transmitted by this email.
If that's possible, I've attached a script here. To run it, you should download this script, source it (do not call library(scSHC)
!), and then run your scSHC
command as you have been (feel free to use any number of cores, since that does not seem to relate to the problem).
This script does two things: first, it will print out some summary statistics (dimensions, rowMeans, etc.) of the subset of the data that is causing the error, and second, it will save an RDA file called output.rda
containing some various internally computed quantities. If you are able, it would be great if you could share both these summary statistics and the RDA file. If you are not able to share, though, please run it anyway and I will at least ask you questions about the output (but of course it is much easier for me if I am able to look at these quantities myself). Thanks!
Dear Isabella,
Thank you for your ongoing suppert.
This is what I get from your debugging script:
[1] 12924 64549
[1] 37
[1] 45
[1] 21
[1] 0
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00000 0.03658 0.14634 0.47336 0.41463 13.93902
Min. 1st Qu. Median Mean 3rd Qu. Max.
1022 1164 1189 1183 1215 1268
Error in if (test < alpha_level) { :
missing value where TRUE/FALSE needed
I'll send the output.rda
file by mail.
Thank you! I think I have identified the bug (a corner case) -- I made a new branch and pushed a fix. You can install this updated version using devtools::install_github("igrabski/sc-SHC",ref="on_genes_corner_case")
. Please give this version a try (and for now keep the number of cores set to 1, to ensure a more explicit error message in case there is still a problem), and let me know if you still run into an issue!
Apologies, that install command should be devtools::install_github("igrabski/sc-SHC",ref="on_genes_corner_case",force=TRUE)
.
Dear Isabella,
With the version from that branch and parallel=FALSE
I get
Error in eigs_real_sym(A, nrow(A), k, which, sigma, opts, mattype = "sym_matrix", :
dimension of 'A' must be at least 3
with my count_mat
.
Hmm, that seems odd. Just to double-check, are you running these with a seed set? And did the code run for the same amount of time before erroring out, or longer? I am pretty confident the fix should have addressed the specific case that came up with the debugging script, though maybe this error arose again at a later iteration with a different subset of the data.
On Fri, Sep 6, 2024, 5:04 AM Marcel Schilling @.***> wrote:
Dear Isabella,
With the version from that branch and parallel=FALSE I get
Error in eigs_real_sym(A, nrow(A), k, which, sigma, opts, mattype = "sym_matrix", : dimension of 'A' must be at least 3
with my count_mat.
— Reply to this email directly, view it on GitHub https://github.com/igrabski/sc-SHC/issues/25#issuecomment-2333595806, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANOBY2RTWQBTRXGRBU2BPPLZVFVZJAVCNFSM6AAAAABNKFIA62VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMZTGU4TKOBQGY . You are receiving this because you commented.Message ID: @.***>
--
This message is for the recipient’s use only, and may contain confidential, privileged or protected information. Any unauthorized use or dissemination of this communication is prohibited. If you received this message in error, please immediately notify the sender and destroy all copies of this message. The recipient should check this email and any attachments for the presence of viruses, as we accept no liability for any damage caused by any virus transmitted by this email.
Assuming that it did fail at a later point, I do actually have one more thought -- I just updated that same branch, which you can again reinstall with devtools::install_github("igrabski/sc-SHC",ref="on_genes_corner_case",force=TRUE)
. If that doesn't work, I can send a new version of a debugging script. (And thanks again for your patience with this back-and-forth! It's difficult without being able to test on data myself, though I appreciate your help with running all these tests!)
And actually, just to save time, here is a new version of the debugging script -- maybe it makes sense to try this first (which you can do with parallelization).
I use set.seed
and didn't measure runtime as I let it run overnight. Thank you for you help. I'll test the new debugging script next.
The updated debugging script finished without error (using 16 cores). 🎉 I'll try updating from the branch and running the regular function to double-check. 🤞 Thank you for you help. ❤️ Is there any chance to get those changes in a new version release? That would make referencing it for publication and distributing it to other users on our server much easier.
That is great to hear!!
Re: the new version release, would it work for you if I set these changes to a pre-release branch and give it a pre-release version 0.1.0.9001?
All set with the prerelease branch!
I don't see that prerelease anywhere. Also, I get more > 400 clusters, most of them with < 100 cells. Is that normal or could I still hit some corner case with my dataset?
The prerelease branch (with version bump) is located here: https://github.com/igrabski/sc-SHC/tree/prerelease and can be installed by specifying ref="prerelease" in the install_github command.
Regarding the large number of clusters, that is not typical but I have seen it once before. You can check if there is anything unusual about those cells in the tiny clusters, but otherwise unfortunately it is possible this model/algorithm is not a good fit for this data.
On Sat, Sep 7, 2024, 9:53 AM Marcel Schilling @.***> wrote:
I don't see that prerelease anywhere. Also, I get more > 400 clusters, most of them with < 100 cells. Is that normal or could I still hit some corner case with my dataset?
— Reply to this email directly, view it on GitHub https://github.com/igrabski/sc-SHC/issues/25#issuecomment-2335194306, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANOBY2QV3NQ2Z34MSYF7FRTZVMANNAVCNFSM6AAAAABNKFIA62VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMZVGE4TIMZQGY . You are receiving this because you commented.Message ID: @.***>
--
This message is for the recipient’s use only, and may contain confidential, privileged or protected information. Any unauthorized use or dissemination of this communication is prohibited. If you received this message in error, please immediately notify the sender and destroy all copies of this message. The recipient should check this email and any attachments for the presence of viruses, as we accept no liability for any damage caused by any virus transmitted by this email.
OK. Thank you for all your support. I'll leave this open as the bug is not fixed in any release but feel free to close the issue if you consider it done with the prerelease.
I have a Seurat object that I extract the (sparse) count matrix from as follows:
Next, I remove zero-sum rows (actually, I am even slightly more stringent than that):
Finally, I also drop genes with low variance to prevent potential issues when factorizing the matrix:
However, even with all those safeguards in place, I cannot cluster the cells using
scSHC
:This runs for a long time but eventually erorrs out with the following (uninformative) error message:
Any advice on how to troubleshoot this further would be greatly appreciated.