sararselitsky / FastPG

Fast phenograph, CyTOF
Other
25 stars 6 forks source link

Error in FastPG::rcpp_parallel_jce(ind) : negative length vectors are not allowed #19

Open singlecellfan opened 2 years ago

singlecellfan commented 2 years ago

Hi,

thank you for this package, it works quite nicely and is much faster than Phenograph!

I am working with 20 million cells from a flow cytometry analysis. When I used a comparatively low "k" (e.g. k=20, or k=30) I got way too many clusters. Thus, I decided to increase the number of k. When using k=100, or k=200 the number of clusters was reduced but it was still too many. So I increased "k" further (k=400), which is when I obtained this error message:

Error in FastPG::rcpp_parallel_jce(ind) : negative length vectors are not allowed

Based of the documentation you have provided it has something to do with the Jaccard metric. links <- FastPG::rcpp_parallel_jce(ind) links <- FastPG::dedup_links(links)

Do you have any idea what the actual problem might be or how to fix it? I am working on a cluster so it should be fine in terms of memory and computing power.

Thanks much in advance!

sararselitsky commented 2 years ago

I haven't seen this error before, but I included Tom in this thread. Tom, have you?

Can you send me the data? I've never used a k that high, but I'd be interested to check it out.

On Wed, Apr 6, 2022 at 8:40 AM singlecellfan @.***> wrote:

Hi,

thank you for this package, it works quite nicely and is much faster than Phenograph!

I am working with 20 million cells from a flow cytometry analysis. When I used a comparatively low "k" (e.g. k=20, or k=30) I got way too many clusters. Thus, I decided to increase the number of k. When using k=100, or k=200 the number of clusters was reduced but it was still too many. So I increased "k" further (k=400), which is when I obtained this error message:

Error in FastPG::rcpp_parallel_jce(ind) : negative length vectors are not allowed

Based of the documentation you have provided it has something to do with the Jaccard metric. links <- FastPG::rcpp_parallel_jce(ind) links <- FastPG::dedup_links(links)

Do you have any idea what the actual problem might be or how to fix it? I am working on a cluster so it should be fine in terms of memory and computing power.

Thanks much in advance!

— Reply to this email directly, view it on GitHub https://github.com/sararselitsky/FastPG/issues/19, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACQD6T52NP4BF2HBPXMJRSDVDWA2HANCNFSM5SV4XIFQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

singlecellfan commented 2 years ago

Hi,

due to it being an ongoing collaboration with different partners I cannot share the data. But it is simple data matrix generated from a flow cytometry experiment with rows = cells and columns = features.

Alternatively, do you know of any other way to lower the number of clusters?

tom-b commented 2 years ago

Hi,

I'm not precisely sure what is causing this error and I haven't seen it before. But we also didn't comprehensively test all the different parameters (like k) over large ranges of values.

My intuition is that trying to increase the k parameter to reduce the number of final clusters isn't going to necessarily do that anyway - I have always thought of k as just a starting point for getting a graph roughly initialized for starting. I don't think increasing k is logically bound to a smaller number of clusters in the end. It's really just a parameter passed through to the HNSW code anyway. In other words, k just gives us a starting point for our graph construction.

I am really guessing here but I think what we are seeing is actually a memory error. We are starting with 20 million rows and trying to find 400 nearest neighbors for each of those. Which is 8 billion rows of vertex1 to vertex2 data. And running on a cluster isn't going to help with that - FastPG is not a distributed algorithm that runs across multiple machines. It runs on a single machine using all the physical cores available and is limited to the memory on that machine. I strongly suspect that this error wouldn't occur on, for example, some GCP or AWS instance with a bunch of memory . . .

The more interesting question to me is there any way of enforcing a smaller number of ending clusters? How many clusters are you winding up with? What makes you think that number is too large?

It might be that manipulating some of the threshold parameters we could do something that would affect the "stopping" modularity change value so that instead of stopping at that value the louvain clustering continues. But I am generally hesitant to depend on ever more tiny floating point values to actually "mean" what we would like them to. Just saying, at some point the algorithm may fall apart trying to deal with super-small values of modularity change . . .

singlecellfan commented 2 years ago

Hi,

thank you for the detailed explanation. I have read about the number of k and clusters here and thought I give it a try. http://cytoforum.stanford.edu/viewtopic.php?f=1&t=1844

With k=20 I obtained more than 70 clusters and most of them didn't have clear borders. And from a biological point of view we would not expect to find so many subpopulations.

Regarding the memory issue: Although I was hesitant at first (because with k=300 FastPhenograph worked perfectly) you could be right. It might be that with k=400 the memory limit is exceeded. I will need to check on that again.

SamGG commented 2 years ago

Hi, I agree with @tom-b. I would not increase k more than 100 or 150. 70 clusters is not that much. You could try to reduce this number by merging clusters (look at meta-clusters as in FlowSOM for example). As borders are not clear, you should check that there is no batch effect in the dataset or that the markers (dimensions) are no too numerous: start with a small set of clear markers to really conclude. Best.

sararselitsky commented 2 years ago

Thanks, Tom and Samuel!

70 clusters would be difficult to work with, but I haven't gotten that high of an amount before. Just curious, how many markers are you clustering with?

I want to add on to what Samuel said. Is this data noisy? How many cells are in the smallest cluster? Have you run PCA on a subset of these cells to get a sense of the underlying structure or plotted out the distribution of the markers? This may be a QC issue.

On Wed, Apr 6, 2022 at 10:16 AM Samuel Granjeaud @.***> wrote:

Hi, I agree with @tom-b https://github.com/tom-b. I would not increase k more than 100 or 150. 70 clusters is not that much. You could try to reduce this number by merging clusters (look at meta-clusters as in FlowSOM for example). As borders are not clear, you should check that there is no batch effect in the dataset or that the markers (dimensions) are no too numerous: start with a small set of clear markers to really conclude. Best.

— Reply to this email directly, view it on GitHub https://github.com/sararselitsky/FastPG/issues/19#issuecomment-1090324894, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACQD6TYA3B5L25XBXPBRZQTVDWMCBANCNFSM5SV4XIFQ . You are receiving this because you commented.Message ID: @.***>