Hi everyone! I'm Pau, a PhD student from saezlab. We are mostly interested in extracting mechanistic insights from (single-cell) omics data and we would like to propose an OpenProblem around the estimation of Transcription Factor Activities from scRNA-seq. Here's a written a summary, feedback and contributions appreciated, thanks!
Transcription factor activity estimation from scRNA-seq
Transcription Factors (TFs) are key regulators of cell identity and fate. Hence, estimating their activities from scRNA can provide mechanistic insights on many, if not most, scRNA studies. In addition, TF-activity estimates can be used to summarize gene coordination events into a small and interpretable set of features.
Downstream transcriptional targets of a TF yield a much more robust estimation of the TF activity than just observing the expression of the TF itself [1,2,3]. A TF-activity method (TFAM) requires thus a gene regulatory network (GRN) in combination with a statistical algorithm to summarize the expression of the target genes into a single activity score. There are multiple TFAMs for bulk-RNA, and some specific for scRNA data such as metaVIPER [4] or SCENIC [5], combining diverse GRNs and statistical methods. In previous work we have benchmarked both bulk- and scRNA-specific methods on scRNA-seq and found that they seem to be robust to drop-outs and other features of scRNA data [2], and that there are important differences across methods on various in silico and real data benchmarks.
While these results were already informative, we believe that a more systematic and comprehensive analysis is needed to truly determine the quality of the predicted TF activities in different contexts. In particular, we want to include recently developed methods not available in our first benchmark, systematically test combinations of GRNs and statistics, and test the methods in more contexts. For this we suggest to leverage DecoupleR a package to benchmark TFAM methods originally developed for bulk-RNA (as explained here).
There are two key challenges: (i) which is the GRN that better recapitulates TF activities and (ii) determine the best algorithm. To address these challenge, we propose the following components:
Datasets
Expression data:
DoRothEA benchmark data (DBD) [2]: DBD are a curated RNA-seq bulk data-set composed of gene expression data from 124 knockdown and overexpression experiments, covering the perturbation of 62 unique TFs.
PBMCs: public 10X data-set of 3k PBMCs
Perturb-seq [6] : single cell RNA-seq with that contains 26 knock-out perturbations targeting 10 distinct TFs after 7 and 13 days of perturbations.
CRISPRi [7] : single cell RNA-seq with that contains 141 perturbation experiments targeting 50 distinct TFs
SCIRA benchmark data (SBD) [8] : SBD are a collection of independent single cell datasets representing differentiation time courses into mature cells for different tissues (Lung, Liver, Kidney, Pancreas). For each tissue there are ~30 TFs that are expected to exhibit higher regulatory activity in the mature cells than in progenitors.
Multiome PBMC: public 10X Multiome dataset (RNA-seq + ATAC-seq) of 3k PBMCs.
If you know more interesting data-sets please let us know
Gene regulatory networks:
Dorothea [9]
CHEA3 [10]
RegNetwork [11]
More networks are welcome
Methods
For each data-set TF activities will be computed using all possible combinations of gene regulatory networks and algorithms. The vast majority of methods are already implemented in DecoupleR.
Metrics
Silhouette score: Since TF regulation determines cell lineages, one should be able to use them to cluster cells by cell type. Good GRN will be able to estimate TF activities that divide cells into lineages, generating high cluster silhouette scores, while uninformative GRN will cluster them randomly. Dataset to use: PMBC
Receiver Operator Curve (ROC) and Precision-Recall curve (PRC): Perturbation data-sets informing of the expected regulatory outcomes act as a “silver standard” that can be used to ROC and PRC areas under the curve. Datasets to use: DBD, Perturb-seq, CRISPRi, and SCIRA
Correlations: Under the assumption that TF binding potentials estimated from scell ATAC data can represent pseudo-true positives of TF activities in single cells, a good GRN coupled with an inference method will estimate TF activities from sc-RNAseq that highly correlate with these potentials. Dataset to use: Multiome
Mean ranking score: To see which combination of GRN+TFAM perform best in all metrics, a mean ranking score can be calculated. First, for each metric the GRN+algorithm combinations are sorted by their performance and a rank is assigned to them. Then, the mean between the rankings can be computed. The top performing network+algorithm will have the lowest mean rank.
Bibliography
Dugourd, A. & Saez-Rodriguez, J. Footprint-based functional analysis of multiomic data. Current Opinion in Systems Biology 15, 82–90 (2019).
Holland, C. H. et al. Robustness and applicability of transcription factor and pathway analysis tools on single-cell RNA-seq data. Genome Biol. 21, 36 (2020).
Alvarez, M. J. et al. Functional characterization of somatic mutations in cancer using network-based inference of protein activity. Nat. Genet. 48, 838–847 (2016).
Ding, H. et al. Quantitative assessment of protein activity in orphan tissues and single cells using the metaVIPER algorithm. Nat. Commun. 9, 1471 (2018).
Aibar, S. et al. SCENIC: single-cell regulatory network inference and clustering. Nat. Methods 14, 1083–1086 (2017).
Dixit, A. et al. Perturb-Seq: Dissecting Molecular Circuits with Scalable Single-Cell RNA Profiling of Pooled Genetic Screens. Cell 167, 1853-1866.e17 (2016).
Genga, R. M. J. et al. Single-Cell RNA-Sequencing-Based CRISPRi Screening Resolves Molecular Drivers of Early Human Endoderm Development. Cell Rep. 27, 708-718.e10 (2019).
Teschendorff, A. E. & Wang, N. Improved detection of tumor suppressor events in single-cell RNA-Seq data. BioRxiv (2020) doi:10.1101/2020.07.04.187781.
Garcia-Alonso, L., Holland, C. H., Ibrahim, M. M., Turei, D. & Saez-Rodriguez, J. Benchmark and integration of resources for the estimation of human transcription factor activities. Genome Res. 29, 1363–1375 (2019).
Keenan, A. B. et al. ChEA3: transcription factor enrichment analysis by orthogonal omics integration. Nucleic Acids Res. 47, W212–W224 (2019).
Liu, Z.-P., Wu, C., Miao, H. & Wu, H. RegNetwork: an integrated database of transcriptional and post-transcriptional regulatory networks in human and mouse. Database (Oxford) 2015, (2015).
Hi everyone! I'm Pau, a PhD student from saezlab. We are mostly interested in extracting mechanistic insights from (single-cell) omics data and we would like to propose an OpenProblem around the estimation of Transcription Factor Activities from scRNA-seq. Here's a written a summary, feedback and contributions appreciated, thanks!
Transcription factor activity estimation from scRNA-seq
Transcription Factors (TFs) are key regulators of cell identity and fate. Hence, estimating their activities from scRNA can provide mechanistic insights on many, if not most, scRNA studies. In addition, TF-activity estimates can be used to summarize gene coordination events into a small and interpretable set of features.
Downstream transcriptional targets of a TF yield a much more robust estimation of the TF activity than just observing the expression of the TF itself [1,2,3]. A TF-activity method (TFAM) requires thus a gene regulatory network (GRN) in combination with a statistical algorithm to summarize the expression of the target genes into a single activity score. There are multiple TFAMs for bulk-RNA, and some specific for scRNA data such as metaVIPER [4] or SCENIC [5], combining diverse GRNs and statistical methods. In previous work we have benchmarked both bulk- and scRNA-specific methods on scRNA-seq and found that they seem to be robust to drop-outs and other features of scRNA data [2], and that there are important differences across methods on various in silico and real data benchmarks.
While these results were already informative, we believe that a more systematic and comprehensive analysis is needed to truly determine the quality of the predicted TF activities in different contexts. In particular, we want to include recently developed methods not available in our first benchmark, systematically test combinations of GRNs and statistics, and test the methods in more contexts. For this we suggest to leverage DecoupleR a package to benchmark TFAM methods originally developed for bulk-RNA (as explained here).
There are two key challenges: (i) which is the GRN that better recapitulates TF activities and (ii) determine the best algorithm. To address these challenge, we propose the following components:
Datasets
Expression data:
DoRothEA benchmark data (DBD) [2]: DBD are a curated RNA-seq bulk data-set composed of gene expression data from 124 knockdown and overexpression experiments, covering the perturbation of 62 unique TFs.
PBMCs: public 10X data-set of 3k PBMCs
Perturb-seq [6] : single cell RNA-seq with that contains 26 knock-out perturbations targeting 10 distinct TFs after 7 and 13 days of perturbations.
CRISPRi [7] : single cell RNA-seq with that contains 141 perturbation experiments targeting 50 distinct TFs
SCIRA benchmark data (SBD) [8] : SBD are a collection of independent single cell datasets representing differentiation time courses into mature cells for different tissues (Lung, Liver, Kidney, Pancreas). For each tissue there are ~30 TFs that are expected to exhibit higher regulatory activity in the mature cells than in progenitors.
Multiome PBMC: public 10X Multiome dataset (RNA-seq + ATAC-seq) of 3k PBMCs.
If you know more interesting data-sets please let us know
Gene regulatory networks:
Methods
For each data-set TF activities will be computed using all possible combinations of gene regulatory networks and algorithms. The vast majority of methods are already implemented in DecoupleR.
Metrics
Silhouette score: Since TF regulation determines cell lineages, one should be able to use them to cluster cells by cell type. Good GRN will be able to estimate TF activities that divide cells into lineages, generating high cluster silhouette scores, while uninformative GRN will cluster them randomly. Dataset to use: PMBC
Receiver Operator Curve (ROC) and Precision-Recall curve (PRC): Perturbation data-sets informing of the expected regulatory outcomes act as a “silver standard” that can be used to ROC and PRC areas under the curve. Datasets to use: DBD, Perturb-seq, CRISPRi, and SCIRA
Correlations: Under the assumption that TF binding potentials estimated from scell ATAC data can represent pseudo-true positives of TF activities in single cells, a good GRN coupled with an inference method will estimate TF activities from sc-RNAseq that highly correlate with these potentials. Dataset to use: Multiome
Mean ranking score: To see which combination of GRN+TFAM perform best in all metrics, a mean ranking score can be calculated. First, for each metric the GRN+algorithm combinations are sorted by their performance and a rank is assigned to them. Then, the mean between the rankings can be computed. The top performing network+algorithm will have the lowest mean rank.
Bibliography