openproblems-bio / openproblems

Formalizing and benchmarking open problems in single-cell genomics
MIT License
313 stars 78 forks source link

[Task proposal] Call gene expression from scRNA-Seq #240

Closed AlexWeinreb closed 3 weeks ago

AlexWeinreb commented 3 years ago

Describe the problem concisely. RNA-Seq provides quantitative measures of RNA abundance in the sequenced cell types. However, a major question of interest is typically quantitative: is the protein encoded by a given gene expressed in a given cell type? Or an alternative formulation: what is the probability that a given gene is expressed (or not) in a given cell type?

Propose datasets CeNGEN: the worm nervous system The CeNGEN consortium addressed this question in the nervous system of C. elegans, using 10x scRNA-Seq of every neuron type, along with a a collection of previously reported expression patterns, based on fluorescent reporters ("ground truth"). The annotated sequencing data can be downloaded as a Seurat (4 GB) or Monocle (300 MB) object. The ground truth is available as Table S5 of our preprint. The ground truth is compiled mainly from 4 previous publications (the non-neuronal genes come from a variety of sources).

Other datasets It is easy enough to find other scRNA-Seq datasets, the difficulty is in gathering a high-quality ground truth (see the metrics section below). Other systems for which both scRNA-Seq and high-quality ground truth may be available include:

Input from experts is necessary to judge the availability and quality of the ground truth in these systems.

Propose methods Simple thresholding (or logistic regression) is a valid approach. In that case, we noticed empirically that thresholding on the proportions of cells in a cluster with at least 1 UMI for a gene was a better predictor than total UMI count and its normalized variants (the image estimator from Booeshaghi and Pachter (2021) is equivalent to the proportion it is computed from). We obtained better results using a percentile threshold (each gene is first normalized by the highest proportion across cell types), along with a set of static thresholds to account for genes expressed everywhere or nowhere. The corresponding code is here.

Davis, Nern et al. (2020) called gene expression in the fly optic lobe, where each cell type was sequenced by TAPIN-Seq, then using a Bayesian approach to mixture modeling (fitting both a unimodal and bimodal distribution to each gene, then using cross-validation with PSIS-LOO to select the best fit). As their data was not generated from droplet-based sequencing, it has a different statistical behavior and this method is not immediately applicable. The corresponding code is here.

Propose metrics The metrics directly useful to the experimentalist would be the True Positive Rate, False Discovery Rate, and False Positive Rate, evaluated on an unbiased set of benchmark genes. Alternative measures (e.g. AUC) letting the end-user choose their preferred TPR/FDR are also suitable.

The difficulty in these metrics is that they typically require the evaluation of gene expression in every cell type in the system with an orthogonal method. Indeed, if a portion of the cell types in a system have not been examined, the FDR cannot be evaluated reliably.

A final note: as biological processes downstream of transcription might influence protein expression, RNA-Seq-based methods have an upper bound on their predictive capabilities. For this open problem to be considered fully solved, this upper bound needs to be estimated. See for example Buccitelli and Selbach (2020).

LuckyMD commented 3 years ago

Hi @AlexWeinreb,

Thanks for proposing this task. Just to clarify... you're proposing a task to assess whether a protein is expressed in a particular dataset dependent on gene expression data. Would this be similar to the "regulatory effect prediction" task but for RNA -> protein instead of ATAC -> RNA? That would be inferring protein abundance from scRNA-seq data.

Or are you viewing this as a denoising task for the same modality? That would mean is a gene actually expressed or not if it has 0 reads in a dataset.

I think it would be helpful to frame the task in one of these two methods. There are many methods available especially for denoising.

AlexWeinreb commented 3 years ago

Hi @LuckyMD,

Yes, it's definitely a case of denoising. But I was under the impression the current denoising task focuses on the imputation aspect ("if a gene has 0 reads, is it a biological 0 or a drop-out?"), whereas this focuses on the thresholding ("if a gene has 1 read, is it biologically meaningful or background noise? What about 2 reads? ..."). More directly relevant are the papers that model noise sources, but, while there is some estimation of a number of mRNA molecules based on spike-ins, I don't think any paper attempted to actually call expression (but I may have missed it)?

It does seem similar to the "Predicting gene expression from chromatin accessibility" task. My impression is that the nature of the data and the metrics will be different enough that grouping them may not help. As in, a single solution is unlikely to work for both aspects, so the task isn't nuclear anymore. Do you think they could be grouped in a helpful way?

github-actions[bot] commented 3 weeks ago

This issue has been automatically closed because it has not had recent activity.