tomthomas3000 commented 3 years ago

Thank you for this update to the clustcounts package. I am working with a relatively large dataset (~1.2 mill cells), I have subsetted into relevant 'buckets' to find relevant topics beyond traditional 'clustering' using topic modelling. This still yields datasets ranging in size from 30k-300k cells. As such, i had a few questions re: working with large datasets as below:

1) Do you have any suggestions on how to handle this computationally - particularly in terms of parallelisation to run _fit_topicmodel to improve time taken to run?

2) Furthermore, any suggestions on finding an optimum value for k besides trial and error?

Thank you again.

Kind Regards, Tom

pcarbo commented 3 years ago

@tomthomas3000 fastTopics should be able to handle 1.2 million cells provided that you have sufficient memory. Based on my experience, I would recommend trying something like this to get reasonably good results relatively quickly:

fit <- fit_poisson_nmf(X,k,numiter = 200,method = "scd",control = list(numiter = 4,extrapolate = TRUE,nc = nc))

where "nc" here is the number of cores or CPUs available on your machine.

For the choice of k, there may not be a single "best" k; consider that sometimes larger values of k can provide higher level of detail, but smaller values of k can be helpful for learning about higher level structure in the data. In any case you may find that functions plot_loglik_vs_rank and compare_fits are useful for helping guide the choice of k.

tomthomas3000 commented 3 years ago

Thank you for the suggestion! I was trialling this on a portion of the overall dataset [13.5k cells] with 24 cores [each core has 16GB]. K=6. The init method changed from topicscore to random initialisation. Could you suggest why this might be the case? Also, this taking ~24 hours in total to run for 13.5k cells* - could this be because of the switch to random initialisation?

Full output below:

Input matrix "X" has less than 10% nonzero entries; consider converting "X" to a sparse matrix to reduce computational effort

Using 24 RcppParallel threads.

Initializing factors using Topic SCORE algorithm.

Topic SCORE failure occurred; using random initialization instead.

Fitting rank-6 Poisson NMF to 13449 x 33075 dense matrix.

Running 200 SCD updates, with extrapolation (fastTopics 0.5-59).

Thank you!

pcarbo commented 3 years ago

@tomthomas3000 The main issue I see is that X is not encoded as a sparse matrix, even though the data are indeed sparse. So I would try something like this first:

library(Matrix)
X <- as(X,"dgCMatrix")

tomthomas3000 commented 3 years ago

thank you @pcarbo - I have added this step to the workflow, and it no longer throws up that point re: Input matrix "X" has less than 10% nonzero entries; consider converting "X" to a sparse matrix to reduce computational effort. This is already faster than before, so much better especially when I scale up to buckets of cells with 300k cells.

However, I still get Topic SCORE failure, and then uses random initialisation instead. What could the potential causes be, and what would the impact be on run times? Furthermore, should I expect a completely linear relationship with increasing 'k' values and the run time?

Also, do you have any thoughts on excluding common genes (i.e. occurring in more than say 97% of cells), or selecting only highly variable genes (n=5000) etc.?

Thank you again for being so quick with your replies!

pcarbo commented 3 years ago

@tomthomas3000 This is not a critical issue; a random initialization should be okay. Yes, runtime should increase linearly with k. My default would be to use all genes, but a case could be made for filtering out genes if, say, it is known in advance that some genes will be unhelpful for distinguishing cell types.

tomthomas3000 commented 3 years ago

@pcarbo fantastic - another (last) question, does the number of updates (numiter) required to improve the NMF model vary with specified K values i.e. for different K values, different numbers of updates would be required to estimate parameters to get close to the maximum-likelihood solution? Thanks again and apologies for the mini q&a session.

pcarbo commented 3 years ago

@tomthomas3000 Broadly speaking, no, or at least not directly; it would depend less on the size of K, and more on to what extent the topics (or basis vectors) are interdependent, which I suppose could become more likely as K increases. In short I cannot provide general guidance for that question.

tomthomas3000 commented 3 years ago

no worries and understandable - another query to prevent misunderstanding on my part: for the purpose of quick results, you recommended running above: fit <- fit_poisson_nmf(X,k,numiter = 200,method = "scd",control = list(numiter = 4,extrapolate = TRUE,nc = nc))

Should I be running init_poisson_nmf manually/explicitly before running fit_poisson_nmf as per your suggestion. Do I need to refine the fit by running fit_poisson_nmf for a second time ? And following on from this, having ran fit_poisson_nmf and deemed a particular value of K as suitable with plot_loglik_vs_rank and compare_fits, do I need to run poisson2multinom to recover the topic model before continuing to DE etc.?

I suppose the principle behind this question is, with fit_topic_model, we (1) initialize the Poisson NMF model fit (init_poisson_nmf); (2) perform the main model fitting step using fit_poisson_nmf; (3) refine the fit by running extrapolated updates, again using fit_poisson_nmf; and (4) recover the multinomial topic model by calling poisson2multinom.

Just wanted to make sure whether running fit_poisson_nmf on its own was sufficient to replace the 4 steps performed for us by fit_topic_model. Thank you!

pcarbo commented 3 years ago

@tomthomas3000 fit_topic_model is effectively just a wrapper for fit_poisson_nmf + poisson2multinom; you can use either one, it is just a matter of preference. Generally I would recommend using fit_poisson_nmf if you want finer control over the optmization.

You could also run init_poisson_nmf, but it isn't necessary. I would only recommend using init_poisson_nmf if, say, there was some additional information that could be used to obtain good initial estimates of the model parameters.

Since you are interested in interpreting the topics then, yes, you should use poisson2multinom before proceeding with other steps of the analysis.

Hope this helps.

tomthomas3000 commented 3 years ago

fantastic - thank you for your helpful suggestions and advice!

stephenslab / fastTopics

handling large datasets #19

Input matrix "X" has less than 10% nonzero entries; consider converting "X" to a sparse matrix to reduce computational effort

Using 24 RcppParallel threads.

Initializing factors using Topic SCORE algorithm.

Topic SCORE failure occurred; using random initialization instead.

Fitting rank-6 Poisson NMF to 13449 x 33075 dense matrix.

Running 200 SCD updates, with extrapolation (fastTopics 0.5-59).