Computing topics over normalized data

stephenslab / fastTopics

Fast algorithms for fitting topic models and non-negative matrix factorizations to count data.

https://stephenslab.github.io/fastTopics

Other

77 stars 7 forks source link

Computing topics over normalized data #8

Closed GreenGilad closed 3 years ago

GreenGilad commented 3 years ago

In the "Analysis of single-cell RNA-seq data, Part 1" vignette you explain that the topic models should be executed over the counts data.

However, can it run over non discrete counts data? For example, when running on the integrated data of two datasets using the Seurat integration procedure?

Thanks, Gilad Green

pcarbo commented 3 years ago

@GreenGilad There is only one strict requirement: the count data should be non-negative numbers. Normally I would GetAssayData(object,"counts") from Seurat as the X input to fit_poisson_nmf or fit_topic_model. So hopefully you plan to do something similar? Also please know that we have a Seurat wrapper in development here.

GreenGilad commented 3 years ago

@pcarbo Thanks for the quick reply! Exactly, over a single Seurat object I do plan to do something that looks like this. The question is, what would be a good approach over an integrated dataset? In that case we do not have the counts data but only the data (normalized) data. By shifting the values in the matrix such that there are no negative values I will be able to run the topics over the normalized data but the question is:

Does it make sense to do so? The raw counts are natural numbers where as the shifted normalized data have any non-negative rational number. In the EM algorithm it tries to maximize the likelihood of the lambda of the Poisson (which is over natural numbers).

pcarbo commented 3 years ago

@GreenGilad I suggest following up by email. fastTopics may or may not be appropriate for your setting; we have not yet tested fastTopics for joint analysis of multiple data sets (this is something we are actively exploring). If the differences between the data sets are "small enough", then I think it would be reasonable to run fastTopics directly on the raw counts. A simple thing to do would be to run fastTopics separately on the individual data sets and on the combined data set and compare the results (there are however some subtleties in comparing the results effectively).

pcarbo commented 3 years ago

Look in the DESCRIPTION file.

inbarsh2 commented 2 years ago

Hi, I encountered a similar problem in which I try to run fastTopic on integrated data. I would like to run fastTopic on each dataset separately, as you suggested, but I am not sure how to effectively compare the results. thank you!

pcarbo commented 2 years ago

@inbarsh2 Could you explain in more detail what you mean by "compare the results"?

inbarsh2 commented 2 years ago

Sure. I have data from patients with high variability between samples I need to overcome. I would like to use fastTopic on each sample separately and then find common expression programs or genes. My question is what is the best way to do so. I also tried to integrate the data and then run fastTopic, but it isn't the correct input for the algorithm since the matrix is scaled (as described above). Thank you.

pcarbo commented 2 years ago

@inbarsh2 I would start by running fastTopics on the raw count data for all the samples and see what the results look like; are some topics capturing sample-specific effects? So you have access to the raw count data?