How to determine the value of the "gate" to filter cells

xnaer commented 6 years ago

Hi,

When I gonna to filter out cells, I found one of the option in "creatseuratobject" is "min.genes". But I don't know how to determine this value. Anyone can help me? Thx.

Bests,

Na

mschilli87 commented 6 years ago

@xnaer: Why did you close the issue? Did you find a solution? Would you mind sharing it with others? :wink:

xnaer commented 6 years ago

@mschilli87 Hi, I think it depends on user's datasets. I tried to set this value to "0" and check the gene plot. Then I filtered cells according to the plot. This is the gene plot of my dataset. There are 3 cells separated from the others. So I set the low.threshold of "nGene" as 500 to filter these 3 cells. I think it works well. Is this right?

leonfodoulian commented 6 years ago

Hi @xnaer,

The question you raised here is, in my opinion, a very delicate one. Filtering for cells include filtering for low quality cells (i.e. cells that yield low mRNA levels, apoptotic cells, etc.), as well as for multiplets (i.e. doublets, triplets, etc.). A common measure includes the use of the number of expressed genes (detected genes) and the number of detected UMI per cell to detect such cells. However, these measures vary drastically with cell type, which renders this task even more difficult with the generation of very large datasets bearing a large number of diverse cell types. In my experience with this, I had at least 4-fold increase in the number of detected genes between supporting cells, and neurons! Don't forget about rare cell types too... If you are using Fluidigm's C1 IFC (HT or not) as a tool to capture your cells, you have the advantage of visually inspecting your cells. I am happy to share one of the figures I have generated several months ago (which will probably never make it in a publication) using cells that were captured and visually inspected using Fluidigm's C1 HT IFC system.

scRNAseq-e20170403_LFJTannotations20170620_qc_plots.pdf

You can appreciate from the plot the fact that many visually empty capture chambers can be mistakenly taken for cells when inspected merely based on the measures that are listed above (excluding the visual inspection), while doublets can be barely distinguished from actual cells. Also, in many cases, using additional metrics that are not included in the figure, some capture chambers that seem visually empty exhibit cell-like expression patterns of landmark genes (those are the cells I called Celluloids). More careful visual inspection of these celluloids hints to the presence of putative cells (highlighted in red in the plots). All of these make it very difficult to find a consensus based on which one can accurately filter out cells without losing rare cell types.

Another practice that is less common, but that is becoming more prevalent in single-cell RNA-seq publications (at least this is my feeling about it while reading recent papers) is filtering for low quality cells or multiplets post-clustering. This latter approach is interesting per se, because these cells which ought to be filtered exhibit gene expression patterns that are very different from actual good quality cells, and therefore cluster together. And by plotting gene expression levels onto tSNE plots, one can fairly distinguish and filter out these cells. So probably you might want to apply this second approach too, and filter your cells pre- and post-clustering.

Best, Leon

xnaer commented 6 years ago

@leonfodoulian Hi Leon,

Thanks. I have benefited a lot from your reply. As for the second approach, how to identify the low quality cluster? Should I check the expression of housekeeping gene?

Bests,

Na

leonfodoulian commented 6 years ago

Hi,

A colleague of mine have once performed qPCR on housekeeping genes across various tissues, and you can immediately see that many housekeeping genes are tissue dependent (and most probably cell type dependent too), hence replicating previous observations. For example, Actb is absent in skeletal muscle cells. So it is delicate to use these genes as your sole criteria.

One additional thing you can do is to set a criteria on the number of differentially expressed genes (or marker genes) per cluster. You can therefore exclude cells pertaining to clusters that exhibit low number of marker genes. Also, inspect carefully those genes (i.e. compare the number of protein coding genes to that of pseudogenes). In one of my experiments, I had a cluster that had only gene models as markers. Also, when it comes to multiplets, try to use some previous knowledge based on other types of experiments to sort out cells pertaining to clusters that express two (or more) markers that are known to be cell type exclusive. You can also perform FISH experiments post clustering and differential expression analysis, to confirm the validity of some of the clusters (which in my opinion should become a common practice in single-cell RNA-seq experiments). However, keep in mind that scRNAseq experiments are noisy, and therefore you may encounter some cell types expressing exclusive markers of other cell types. Therefore, your inspection should be confined to the cluster level, rather than the individual cell level.

There are also other means to approach this, I am sure. And maybe others that are interested in this issue and have a say on it can share their experience too.

Best, Leon

yueqiw commented 6 years ago

@leonfodoulian Thanks for the explanation.

I have two questions about the post-clustering filtering approach you mentioned. (1) I have encountered clusters that have low nGene and nUMI and show very few significant marker genes. How do we know for sure that they are low-quality cells rather than some unknown cell types/states during development?

(2) About filtering multiplets. Sometimes multiplets can still cluster with normal single cells, and because different cell types may have different numbers of nUMI and nGene, some multiplets cannot be filtered out using a global threshold prior to clustering. An alternative approach would be to look at each individual cluster separately and filter out the cells with high nGene or nUMI compared to the rest of the cells in the same cluster. Is this a valid approach to remove potential multiplets?

Thanks!

leonfodoulian commented 6 years ago

Hi,

(1) I have encountered clusters that have low nGene and nUMI and show very few significant marker genes. How do we know for sure that they are low-quality cells rather than some unknown cell types/states during development?

In my opinion, the only thing that can validate this cluster as being a rare cell type is to further investigate its gene markers list, e.g. by testing them with complementary approaches such as through in situ hybridisation. Regarding the developmental state, it all depends on your prior knowledge about the tissue you are studying. If you expect to have cells undergoing division, you might want to perform complementary analysis that could hint to this hypothesis. There are a lot of tools out there that you can use to answer this specific question. The main idea is that what ever the clusters you get, the only way to make sense out of them is to go back to the biology of the tissue you are studying.

(2) About filtering multiplets. Sometimes multiplets can still cluster with normal single cells, and because different cell types may have different numbers of nUMI and nGene, some multiplets cannot be filtered out using a global threshold prior to clustering. An alternative approach would be to look at each individual cluster separately and filter out the cells with high nGene or nUMI compared to the rest of the cells in the same cluster. Is this a valid approach to remove potential multiplets?

In my opinion, this makes absolutely sense. However, I was never able to distinguish between singlets and multiplets based on nGene or nUMI. Also, singlets that are large cells have more expressed genes, and UMIs than cells that are of smaller size. You might want to check gene expression levels, and co-expression of genes that you would normally not expect to occur before filtering such cells. The only drawback is that if your multiplet is formed of 2 (or more) cells from the same type, then this would not be an efficient way to filter them out. But on average, you would expect to have higher gene expression levels in multiplets than singlets.

I hope I was able to answer to your questions!

Best, Leon

worker000000 commented 5 years ago

Hi,

A colleague of mine have once performed qPCR on housekeeping genes across various tissues, and you can immediately see that many housekeeping genes are tissue dependent (and most probably cell type dependent too), hence replicating previous observations. For example, Actb is absent in skeletal muscle cells. So it is delicate to use these genes as your sole criteria.

One additional thing you can do is to set a criteria on the number of differentially expressed genes (or marker genes) per cluster. You can therefore exclude cells pertaining to clusters that exhibit low number of marker genes. Also, inspect carefully those genes (i.e. compare the number of protein coding genes to that of pseudogenes). In one of my experiments, I had a cluster that had only gene models as markers. Also, when it comes to multiplets, try to use some previous knowledge based on other types of experiments to sort out cells pertaining to clusters that express two (or more) markers that are known to be cell type exclusive. You can also perform FISH experiments post clustering and differential expression analysis, to confirm the validity of some of the clusters (which in my opinion should become a common practice in single-cell RNA-seq experiments). However, keep in mind that scRNAseq experiments are noisy, and therefore you may encounter some cell types expressing exclusive markers of other cell types. Therefore, your inspection should be confined to the cluster level, rather than the individual cell level.

There are also other means to approach this, I am sure. And maybe others that are interested in this issue and have a say on it can share their experience too.

Best, Leon

you really do a great job, so professor, it is hard to make scrna-seq analysis a auto-pipline because you mentioned a lot of limits about the filter , thanks a lot

worker000000 commented 5 years ago

Hi,

(1) I have encountered clusters that have low nGene and nUMI and show very few significant marker genes. How do we know for sure that they are low-quality cells rather than some unknown cell types/states during development?

In my opinion, the only thing that can validate this cluster as being a rare cell type is to further investigate its gene markers list, e.g. by testing them with complementary approaches such as through in situ hybridisation. Regarding the developmental state, it all depends on your prior knowledge about the tissue you are studying. If you expect to have cells undergoing division, you might want to perform complementary analysis that could hint to this hypothesis. There are a lot of tools out there that you can use to answer this specific question. The main idea is that what ever the clusters you get, the only way to make sense out of them is to go back to the biology of the tissue you are studying.

(2) About filtering multiplets. Sometimes multiplets can still cluster with normal single cells, and because different cell types may have different numbers of nUMI and nGene, some multiplets cannot be filtered out using a global threshold prior to clustering. An alternative approach would be to look at each individual cluster separately and filter out the cells with high nGene or nUMI compared to the rest of the cells in the same cluster. Is this a valid approach to remove potential multiplets?

In my opinion, this makes absolutely sense. However, I was never able to distinguish between singlets and multiplets based on nGene or nUMI. Also, singlets that are large cells have more expressed genes, and UMIs than cells that are of smaller size. You might want to check gene expression levels, and co-expression of genes that you would normally not expect to occur before filtering such cells. The only drawback is that if your multiplet is formed of 2 (or more) cells from the same type, then this would not be an efficient way to filter them out. But on average, you would expect to have higher gene expression levels in multiplets than singlets.

I hope I was able to answer to your questions!

Best, Leon

Hi,

(1) I have encountered clusters that have low nGene and nUMI and show very few significant marker genes. How do we know for sure that they are low-quality cells rather than some unknown cell types/states during development?

In my opinion, the only thing that can validate this cluster as being a rare cell type is to further investigate its gene markers list, e.g. by testing them with complementary approaches such as through in situ hybridisation. Regarding the developmental state, it all depends on your prior knowledge about the tissue you are studying. If you expect to have cells undergoing division, you might want to perform complementary analysis that could hint to this hypothesis. There are a lot of tools out there that you can use to answer this specific question. The main idea is that what ever the clusters you get, the only way to make sense out of them is to go back to the biology of the tissue you are studying.

(2) About filtering multiplets. Sometimes multiplets can still cluster with normal single cells, and because different cell types may have different numbers of nUMI and nGene, some multiplets cannot be filtered out using a global threshold prior to clustering. An alternative approach would be to look at each individual cluster separately and filter out the cells with high nGene or nUMI compared to the rest of the cells in the same cluster. Is this a valid approach to remove potential multiplets?

In my opinion, this makes absolutely sense. However, I was never able to distinguish between singlets and multiplets based on nGene or nUMI. Also, singlets that are large cells have more expressed genes, and UMIs than cells that are of smaller size. You might want to check gene expression levels, and co-expression of genes that you would normally not expect to occur before filtering such cells. The only drawback is that if your multiplet is formed of 2 (or more) cells from the same type, then this would not be an efficient way to filter them out. But on average, you would expect to have higher gene expression levels in multiplets than singlets.

I hope I was able to answer to your questions!

Best, Leon

professor， I do not know whether multiplet should replace siglets in you sentence " Also, singlets that are large cells ", thanks a lot.

satijalab / seurat

How to determine the value of the "gate" to filter cells #259