tanaylab / metacells

Metacells - Single-cell RNA Sequencing Analysis
MIT License
87 stars 8 forks source link

Exclude genes #6

Closed hurleyLi closed 2 years ago

hurleyLi commented 3 years ago

Hi, Thanks for the tutorial and I was able to run through it. I just had a question about the excluded genes. I wonder whether there is any reference for removing the genes in your tutorial 'IGHMBP2', 'IGLL1', 'IGLL5', 'IGLON5', 'NEAT1', 'TMSB10', 'TMSB4X'. Also, what should be the rationale of removing genes from the beginning (e.g., outliers, extremely high expression, genes known to be biased from single-cell experiment?) Thanks! Hurley

orenbenkiki commented 3 years ago

The full answer for this is that we are working on a vignette that demonstrates the iterative process - taking raw data, generating MCs, figuring out genes that need to be excluded/forbidden, repeating this until we reach "high quality" annotations. However due to vacations etc. this won't be available for a few months.

The short answer is that the "forbidden" (can't be "feature") genes are genes that describe biological processes which are not relevant to the question being investigated - e.g. cell cycle genes are irrelevant for finding "cell types". These genes are never used to create metacells but we still try to ensure that each metacell is reasonably uniform in their expression level (that is, we will try hard to separate, say, T-cells from B-cells, but having done that, we'd still try to ensure each T-metacell or B-metacell has a single cell cycle state).

In contrast, fully "excluding" genes (taking them out completely) means we don't care at all whether the cells in each metacell have the same or varying expression level of these genes. Also, excluding genes impacts the "total UMIs" we use for viewing gene expression as a fraction. We typically exclude all mitochondrial genes and genes which are highly associated with them so that the "total UMIs" we look at are actually "total except for mitochondria".

hurleyLi commented 3 years ago

Thanks for the detailed explanation! It's very clear! I guess my confusion was still why you specifically excluded 'IGHMBP2', 'IGLL1', 'IGLL5', 'IGLON5', 'NEAT1', 'TMSB10', 'TMSB4X'? Is there any reference or guidance to pick those genes? e.g. extremely high abundance? or anything related to the 10X protocol? Because one of those genes happened to be our interest gene...

orenbenkiki commented 3 years ago

I wasn't the one who did the analysis so I'll have to check. I'm more of a computational guy and I let the biologists make these calls. Which of the genes are you interested in?

cartographerJ commented 3 years ago

I have the same question for the IFI* genes in general

orenbenkiki commented 3 years ago

There's always https://www.genecards.org/ - at the end of the day, what to exclude and what to merely forbid remains the biologist call, and it depends on both the data set and the biological question that is being investigated.