tanaylab / metacells

Metacells - Single-cell RNA Sequencing Analysis
MIT License
86 stars 8 forks source link

Rare genes, rare cells, and outliers in divide and conquer #74

Closed LoaiNaom closed 1 month ago

LoaiNaom commented 1 month ago

When running divide_and_conquer_pipeline using the default parameters values, I get this information and I would very much appreciate some guidance in the matter :

set adata.var[rare_gene]: 180 true (0.8069%) out of 22308 bools
set adata.var[rare_gene_module]: 22128 outliers (99.19%) and 180 grouped (0.8069%) out of 22308 int32 elements with 14 groups with mean size 12.86
set adata.obs[cells_rare_gene_module]: 2692436 outliers (99.63%) and 9996 grouped (0.3699%) out of 2702432 int32 elements with 14 groups with mean size 714
set adata.obs[rare_cell]: 9996 true (0.3699%) out of 2702432 bools

could you please clarify what are the rare genes and rare cells ? As you can see I have very few of them, and most of the data are being marked as outliers for some reason. What is the meaning of these outliers and how does all this affect the metacells calculations ?

orenbenkiki commented 1 month ago

Rare gene modules capture rare and weak gene programs that would otherwise be missed by the algorithm. What you see is our rare behavior detection pre-processing phase detected that 14 such programs created a metacell for each one. Naturally, being rare, most of the data is not captured by these, so from that point of view, most of the data is an outlier compared to them.

LoaiNaom commented 1 month ago

@orenbenkiki Thank you. Really helpful! Lastly: any tips for better cells separation for a very big dataset ? Dataset included multiple cell types, immune and non-immune, trying to separate them. So far unsuccessfully. Perhaps you could recommend changing some parameters or something like that, for a better separation.

orenbenkiki commented 1 month ago

Metacells lives and dies by the lateral genes list. The main tool we use for this is looking at the markers heatmap in MCView, and search for genes that "shouldn't be there" - that is, strong marker genes (which therefore were used by the algorithm to collect the metacells) which reflect biology irrelevant to the question at hand (e.g., cell cycle, stress, hypoxia, etc.). Adding these to the lateral genes list and recomputing the metacells should help. You may have to repeat this process a few times.

Another separate issue is strong batch effects - due to using different technologies or protocols or similar reasons. You can detect this by viewing the % of the cells from each batch in the metacells - ideally it should be pretty uniform within each "cell type". If you see metacells that come from only one batch, even though they "should" be the same as others, then use differential expression to try and figure out which genes are to blame, mark them as lateral if possible, or use some pre-processing of the batches to fix the issue. These issues sometimes get messy having to decide whether the batch differences are real or technical...

Either way, getting good metacells is an iterative process - we always have to go through this cycle a few times (and also remove doublet cells and/or other "junk" cells in the process) to get a high-quality result.

LoaiNaom commented 1 month ago

@orenbenkiki Yes I tried using MCView, but I faced some issues with it in python. The usage of this tool is not very clear. I'll use the heatmaps, as you did in the tutorial. Thanks a lot for your help!