Feature request: Batch correction variable?

stephenslab / fastTopics

Fast algorithms for fitting topic models and non-negative matrix factorizations to count data.

https://stephenslab.github.io/fastTopics

Other

77 stars 7 forks source link

Feature request: Batch correction variable? #26

Closed dagarfield closed 2 years ago

dagarfield commented 2 years ago

Hello! Very happy with this software tool. I was wondering if it is possible to include a batch correction variable in the topic calculation in cases where a dataset consists of multiple batches and simple batch effects (of the sort that might be handled with an extra term in a linear model).

Cheers,

pcarbo commented 2 years ago

@dagarfield Thanks for the positive feedback on fastTopics.

Re: batch effects (or other fixed effects). This is something that has come up before, but isn't currently implemented. glmpca is an R package that might better accommodate your setting; it allows for defining batch effects or other fixed covariates. An ad hoc alternative to consider would be to try running fastTopics on all the data, then independently on each batch (or some variation on that approach).

dagarfield commented 2 years ago

So far it hasn't really been an issue for this dataset, but will limit our ability to use this approach as project start to contain data from more and more runs (and possible labs). I might thus gentle encourage such an approach (we like the interface of your code too). But I'll check out GLMPCA, thanks for that. Had some luck with batch effects and the iNMF approach in Liger, but nothing beats Structure plots for making the experimentalists happy.....

pcarbo commented 2 years ago

@dagarfield Yes, I agree with you on the need extending the topic model to allow for multiple batches or multiple data sets, similar to what has been in Liger and other methods, or a hierarchical approach similar to Levitin et al. [There is also the structural topic model, or STM, of Roberts et al ("A model of text for experimentation in the social sciences"), and I think there is an R package for STM.] For the immediate term I can also suggest trying a couple things: (1) if batch effects are small you may able to get by without accounting for them; (2) it is possible the topic model will "learn" the batch effects as separate topics, and potentially these could be dealt with post hoc by removing them from the topic model (for the purposes of, say, visualization using a Structure plot). If you come up with any clever ways to use fastTopics to deal with this issue, perhaps in combination with other methods, please let us know.

dagarfield commented 2 years ago

I've been assuming that (2) will rear its head if batch effects are actually a significant issue in this case, so I'm not too worried in this particular case, but thinking more about future runs with more batches where it would be best to model that explicitly. I toyed briefly with the idea of extracting fastMNN corrected values and then rounding them to be integers for fastTopics, but that felt....weird. So I'm just hoping for (1) or (2).

For each topic, I'm also fitting a Beta glm with a random effects term for Sample (which is the main batch effect here) so that I can, on a topic-specific level A) Get some measure for how much Sample matters for that topic B) Get a stronger measure of the extent to which other variables (e.g. Treatment) affect a given topic in a way that is robust to Sample/Batch effects. This seems to work pretty well, at least in so far as it points (in a batch independent way) towards topics worth spending more careful time interpreting.

pcarbo commented 2 years ago

Yes, you could use a beta glm, although for the purposes of exploratory analysis, boxplots and scatterplots comparing topic proportions vs. covariates of interest should work just fine.

Also note that the X in fastTopics need not be integer-valued (it only needs to contain non-negative real numbers).

dagarfield commented 2 years ago

It was more that I wanted to "remove" the batch effect than anything else.

Huh..I had assumed with Poisson-based model, non-integer values would take us to a weird place. This is very good to know, as it opens up a range of possibilities, though presumably we'd want to return the values as close as possible to their original meaning as possible for interpretation (e.g. removing any log transforms or library size normalization)?

pcarbo commented 2 years ago

The topic model is most appropriate for multinomial or Poisson-distributed data. But we can also view the parameter estimation as maximizing an likelihood, and we get a valid likelihood whether or not the data are integer values. Log-transforms and library size normalization are certainly not needed because the topic model already accounts for these.

dagarfield commented 2 years ago

Great, good to know. Then it sounds like one potential solution in the batch case is, indeed, to extract the corrected values (then un-log and un-normalize them) from, e.g. fastMNN, and use these new, non-integer (but still non-negative) values as the input X.