tanaylab / metacells

Metacells - Single-cell RNA Sequencing Analysis
MIT License
86 stars 8 forks source link

How to pick the lateral genes ? #73

Closed LoaiNaom closed 1 month ago

LoaiNaom commented 1 month ago

Thanks for this awesome tool! I have several questions which I would very appreciate if you answer.

1)This question is regarding lateral genes. If I understood correctly those are the genes that we don't want to be in the metacells calculations, but we still want to keep them in the adata object in case for other uses later on. However I was wondering if there is a way to decide which genes should be marked as lateral or if you have examples of popular genes that are usually marked as lateral ?

2)I didn't understand the purpose of metacells.pipeline.select.extract_selected_data, as the data has already been filtered according to multiple thresholds so why would I need this ?

3)Regarding find_metacells_marker_genes why would I need this other than for plotting KNN or UMAP with compute_knn_by_markers and compute_umap_by_markers ? Is this strictly for visualization purposes ?

4)which genes would you recommend removing with metacells.pipeline.exclude.exclude_genes ? For example : gender genes / house keeping / mitochondrial / others ?

orenbenkiki commented 1 month ago

1 - Eventually we'll publish lists in https://github.com/tanaylab/Gmara - at least, that's the plan. Right now, I'm afraid we don't have pre-packaged lateral gene lists.

2 - This selects the subset of genes which actually contain a meaningful signal to use for computing the metacells. This, for one thing, does not select any lateral genes. It also ensures the selected genes aren't "too uniform". Note that this is done separately on each pile (when doing divide-and-conquer), so the list of selected genes is different in different piles (e.g., genes that distinguish T cells from all other cell types might not be selected in a pile consisting only of T cells; instead, the selected genes would be these that distinguish between T cells).

3 - Marker genes are these that are "significantly different" between the metacells. These are the genes we focus on when analyzing the biological behavior of the different cell types, so of course we have many visualization tools for showing them.

4 - The answer is alas similar to 1. That said, most people would want to exclude mitochondrial genes (as suggested in the vignette). House keeping genes are trickier - typically we just mark them as lateral, as they are sometimes correlated with other gene programs.

LoaiNaom commented 1 month ago

@orenbenkiki Thank you!

And how come I explicitly insert excluded_gene_patterns = ['^HSP', '^MT'] for the exclude_genes function, but it ignored this completely and did not filter out those genes, but filtered other genes. Perhaps the excluded_gene_patterns and excluded_gene_names are just suggestions for the function and does not force it to remove them ? Cause it did filter out other genes that I didn't tell it to remove.

orenbenkiki commented 1 month ago

Try ^HSP.* and ^MT.* to match the whole gene name? Also, I think that for mitochondrial genes the proper pattern would be ^MT-.* because there are some MTsomething genes (no -) which aren't mitochndrial.

LoaiNaom commented 1 month ago

Very helpful ! So a metacells pipeline would look something like this ? I'm asking cause not all function are 100% clear so just making sure. Some of these were taken from this vignette

exclude_genes --> exclude_cells --> extract_clean_data --> mark_lateral_genes --> extract_selected_data --> divide_and_conquer_pipeline --> collect_metacells

Seems right ?

orenbenkiki commented 1 month ago

extract_selected_data is done for you internally by the divide_and_conquer_pipeline. Otherwise, yes.

LoaiNaom commented 1 month ago

@orenbenkiki When running divide_and_conquer_pipeline using the default parameters values, I get this information and I would very much appreciate some guidance in the matter :

set adata.var[rare_gene]: 180 true (0.8069%) out of 22308 bools
set adata.var[rare_gene_module]: 22128 outliers (99.19%) and 180 grouped (0.8069%) out of 22308 int32 elements with 14 groups with mean size 12.86
set adata.obs[cells_rare_gene_module]: 2692436 outliers (99.63%) and 9996 grouped (0.3699%) out of 2702432 int32 elements with 14 groups with mean size 714
set adata.obs[rare_cell]: 9996 true (0.3699%) out of 2702432 bools

could you please clarify what are the rare genes and rare cells ? As you can see I have very few of them, and most of the data are being marked as outliers for some reason. What is the meaning of these outliers and how does all this affect the metacells calculations ?