tanaylab / metacells

Metacells - Single-cell RNA Sequencing Analysis
MIT License
86 stars 8 forks source link

Preserving a reasonable number of rare cells #62

Closed annaKett closed 9 months ago

annaKett commented 9 months ago

Hello, first of all thank you for this well documented and easy-to-use package! :) I am working with a large scRNA seq dataset annotated with cell types that I want to group into metacells and then infer pseudo time from with another algorithm. After grouping the metacells, I infer a cell type label based on the majority vote of the contained original cells. (the metacells are only kept if the entropy of the artificial cell type label of the metacell has a lower entropy than the original label across all cells). My problem is that the dataset contains a very small number of stem cells that all 'vanish' during metacell clustering, meaning that after determining the cell types of the metacells, there are no metacells with a stem cell label. Decreasing the target size of the metacells did not help (in fact, the algorithm will throw an error if the target size is smaller than 25 in my case). Since I need the stem cells as root cells for pseudo time trajectory inference in the next step, I need to come up with a solution to maintain the stem cells throughout metacell clustering: Clustering stem cells separately does not work, since I only have 5 stem cells in some cases. I was wondering whether it would be a valid approach to remove the stem cells before metacell clustering and add the single cells as 1-cell metacells to the metacell dataset afterwards. This would mean that I have a metacell dataset containing metacells grouping ~25 cells each, and a few metacells containig exactly one stem cell. I would be really glad to hear from a developer whether, from an algorithm-point-of-view, this would be a valid approach, if I stick to a metacell target size of about 25-50

Thank you lots! -Anna

orenbenkiki commented 9 months ago

It isn't good that the algorithm is missing the stem cells. We tried hard to maximize the sensitivity of the algorithm, but it doesn't always work :-(

That said, in general we don't "believe" in a cell state unless we see a minimal number of cells in that state - the default is 12 (this is why you get an error if the target is less than twice this number). So if you only have fewer such cells we don't "trust" the result enough to accept it as a metacell. Remember that our goal for a robust metacell is ~100 cells - this is so we can have a robust estimator of the gene expression level in the cell state. These are all defaults, in theory you can reduce the minimal number of cells in a metacell, but this will result in small MCs which could easily be just noise.

If these are very few cells, which you can reliably detect, just run the algorithm normally, then manually set the metacell annotation of these few cells to the next-highest metacell index, then run collect metacells. This will "steal" the cells from any metacell they were assigned to, which isn't that bad if there are very few of them.

annaKett commented 9 months ago

Thank you lots for your quick reply! Your suggested approach (manually putting the stem cells in a separate metacell) works well so far. I'd have another question regarding this step and hope that you can help me again: I have multiple datasets. Since I can't integrate and harmonize them properly before applying metacells (because my batch effect correction method will give me a corrected embedding, but not the corrected counts), I want to keep the datasets separate as long as possible. The number of stem cells differ a lot between the datasets, so the manual approach of building a separate stem-metacell per dataset will result in metacells that differ a lot in size between the datasets. E.g., for dataset one, I have only one stem cell, which means that the stem metacell for this dataset will only contain one cell, but for dataset 2, I have 30 stem cells, which means that my stem metacell will contain 30 cells. With your remarks about robust estimators for the metacells in mind, I am almost sure that (a) small stem metacells are not reliable and (b) after integration there will be weird effects in the very small stem metacells compared to the larger (stem and normal) metacells. Do you think the differences in size are problematic, or do you have recommendations on how to handle the stem metacells further (like, maybe splitting large stem metacells?) Thank you a lot!

orenbenkiki commented 9 months ago

Harmonizing data sets is notoriously tricky... I'm not sure what method you use for it after computing the MCs, but if it results in something you can apply to the cells of each dataset (regardless of their type), then a good solution would be to compute the per-dataset MCs, compute the harmonization, apply it to the cells of each dataset, then compute MCs on the single unified dataset. This way there the algorithm "should" have enough data to create stem cell MCs even if the per-dataset runs missed them.

Of this doesn't work and you are forced to use the per-dataset MCs... As I said, in general we don't trust MCs of less than 12 cells - and even then, with such few samples of the cell state, the MC state is highly suspicious. You can decide on your own threshold, I guess, but at some point one has to admit that there's just not enough signal in the data to measure some cell states.

annaKett commented 9 months ago

Thank you for your reply! I use harmony and unfortunately, it only computes a corrected PCA embedding. The publishers of the data I'm using strongly recommend avoiding working with batch corrected counts directly and recommend using the corrected embedding only. So I would have to apply metacells to an embedding, which is not possible, is it? Also, I don't understand why I should compute the MCs, then harmonize, then compute MCs again, could you elaborate on that please? Thank you!

orenbenkiki commented 9 months ago

The point of running MCs on the unified dataset is that cell states that are too rare to be detected in each single dataset will have sufficient numbers to be detectable in the unified dataset - for example, the stem cells. Also, by looking at the composition of each MC, one can get a good notion of whether it exists in multiple datasets, or is unique to one or a few of them.

When you say "avoiding using batch corrected counts" and "using the corrected embedding only" - does that mean they recommend using only the principal component values (the "embedings") instead of corrected per-gene UMIs? I know this is a common practice.

No, I wouldn't recommend running MCs on these - we try hard to compute MCs using "real" (possibly, batch-adjusted) per-gene UMIs, as these allow us to do the follow-up analysis (such as finding marker genes and doing gene-gene plots) which gives interpretable results.

You could just use the dataset-specific MCs, with some threshold on the size of the stem cell MCs, I guess. That wouldn't be as sensitive as magically harmonizing the cell per-gene UMIs and running a unified MC model. However I don't have a textbook method to offer for doing this harmonization :-(

annaKett commented 9 months ago

Ah ok I get it now, thanks!

about: "does that mean they recommend using only the principal component values (the "embedings") instead of corrected per-gene UMIs?" - yes, that's what they mean.

Thank you so much, this has been really helpful! I might get back to you with some follow up questions on this but for now I have a good overview.