jmbreda / Sanity

Filtering of Poison noise on a single-cell RNA-seq UMI count matrix
GNU General Public License v3.0
65 stars 11 forks source link

Evaluating analytical utility of genes with Sanity metrics #19

Closed pschupp closed 1 year ago

pschupp commented 1 year ago

I appreciated your point that only ~10% of genes have sufficient UMI counts per cell to get reasonable estimates of expression and log fold change. Do you have any idea how the ability to detect differentially expressed genes changes over UMI per cell for a certain gene? This could be quantitated by statistical power for example. In "layman's" terms, do you think it is appropriate to characterize the percent genes with a UMI per cell for a gene greater than one as the effective robust "search space"? Conversely, can genes with a UMI per cell for some gene less than one be said to give unreliable analytical results?

jmbreda commented 1 year ago

We have given quite some though on how to best quantify the ‘signal to noise’ in the expression measurements of each gene and we have come up with two measures that I will explain below. Which one of these is most appropriate to use depends on what you want to do.

The first measure of ‘signal-to-noise’ is a measure of how significantly the expression of a gene is changing across conditions. For calculating this it is easiest to use the log-fold changes $\delta{gc}$ and their error bars $\epsilon{gc}$ (that are available in the extended output). For each gene $g$ you would then calculate $S_g^2 = \sumc (\delta{gc}/\epsilon_{gc})^2/C$, where $C$ is the number of cells. The quantity $S_g^2$ measures the square of the number of standard-deviations that the gene $g$ is away from not varying at all, and the overall probability that the gene is not changing expression at all is proportional to $exp(-C \cdot S_g^2)$. So sorting genes by $S_g^2$ would sort genes by how detectable their expression changes across cells are. This is probably the most appropriate to use when you are going to do an analysis that involves expression of multiple genes. For example, when we ourselves use Sanity to calculate distances between cells, we sort genes by this measure $S_g^2$ and then only use genes where $S_g^2 \cdot C$ is over a cut-off, because when $S_g^2 \cdot C$ is low, then one cannot detect meaningful changes in expression of that gene, and it will essentially only add noise to the calculation of cell-to-cell distances.

The second measure is not a measure of statistical significance, but more a measure of how accurate Sanity’s expression estimates are relative to the total amount of variation in the gene’s expression. Again we would use the log fold-changes $\delta{gc}$ and their error-bars $\epsilon{gc}$. We would then calculate the following fraction: $F_g = [\sumc \delta{gc}^2]/[\sumc \delta{gc}^2+\epsilon_{gc}^2]$. The measure $Fg$ lies between zero and one and measures what fraction of the true total variance in gene expression of gene $g$ is captured by the estimated fold-changes $\delta{gc}$. It is thus a measure of how accurately the gene’s fold-changes across cells can be estimated from the data.

Finally, we of course also report on overall estimate of how variable each gene is across cells $vg$ (in the variance.txt file of the extended output). One can of course also sort by this variance, or by the fraction of it that is in the $\delta{gc}$, i.e. $F_g \cdot v_g$.

What precisely to use really will depend on the situation and can be rather subtle. For example, say you have a gene that is everywhere very low expressed except for in a rare cell type, corresponding to only a small number of cells in your data. This gene can be quite interesting as a potential ‘marker’ for that rare cell type, and in this subset of cells its expression can be well estimated. But its expression estimate will be very noisy for the large majority of cells where it is very low expressed, and this will dominate the fraction $F_g$, and thus $F_g$ might be small. You might thus ‘lose’ such a gene when you sort genes by $F_g$ or $F_g \cdot vg$. To find such genes one might even want to look at $S{max} = maxc [(\delta{gc}/\epsilon_{gc})^2]$, which is high whenever there is at least one cell where one can accurately detect a large fold-change.

I am making these comments just to indicate that, because both estimated expression levels and error bars vary across cells, the situation can be subtle.