openproblems-bio / openproblems-v2

Formalizing and benchmarking open problems in single-cell genomics
MIT License
50 stars 19 forks source link

Gene regulatory network inference (with prior knowledge) #457

Open stkmrc opened 1 month ago

stkmrc commented 1 month ago

Task motivation

Gene Regulatory Network (GRN) inference is pivotal in systems biology, offering profound insights into the complex mechanisms that govern gene expression and cellular behavior. These insights are crucial for advancing our understanding of biological processes and have significant implications in medical research, particularly in developing targeted therapies and understanding disease mechanisms.

Computational Challenges

Despite its importance, GRN inference from single-cell RNA-Seq data is challenged by the high dimensionality of the data, inherent data noise, sparsity of the data, sparsity of the networks to be inferred, the lack of known negative edges in the GRN (positive unlabeled setting) and the ambiguity of possible causal explanations for the data. Available computational approaches often struggle with these issues, leading to inaccurate or overfitted models.

Research Gap

Current methods range from statistical correlations to advanced machine learning, each with limitations in terms of accuracy, data requirements, and interpretability. Multiple benchmarking studies exist, differing in the choices of evaluation, such as the way of negative sampling, metrics used and the choice of synthetic vs experimental data. What is missing is a more standardized way of benchmarking using biologically meaningful metrics.

Task description

The task focuses on the inference of GRNs from scRNA-Seq data. It is divided into two subtasks based on the availability of prior knowledge:

  1. GRN Inference without prior knowledge: Inferring GRN solely from scRNA-Seq data.
  2. GRN Inference with prior knowledge: Inferring GRN from scRNA-Seq data using an additional prior knowledge graph (a subset of edges from the ground truth GRN).

Input Data

Expected Output

The output for both subtasks is a predicted GRN, represented as a graph where nodes are genes and edges indicate regulatory interactions. The quality of the predicted networks can be evaluated in two main ways:

  1. Binary Classification: Each potential interaction (edge) is classified as either present or absent (like this)
  2. Topological Evaluation: The overall structure and properties of the predicted network are assessed (like this)

Proposed ground-truth in datasets

  1. Synthetic, Curated and Experimental datasets from (BEELINE)
  2. Experimental datasets from (this paper)

Initial set of methods to implement

  1. MLPs
  2. Graph Neural Network based diffusion models (GCN / GAT)

Proposed control methods

  1. Pearson / Spearman correlation
  2. Random predictor

Proposed Metrics

Binary classification:

  1. Link-equality metrics (AUROC / AUPRC)
  2. Node-equality metrics (Mean Average Precision)
  3. Precision@Top k

Topological evaluation:

  1. Information Exchange (Average Shortest Path Length, Global and Local Efficiency)
  2. Hub Topology (Assortativity, Clustering Coefficient, Centralization)
  3. Hub Identification (PageRank, Betweenness, Radiality, Centrality)
rcannood commented 1 week ago

Hi @stkmrc !

Thanks for creating this issue! I heard from @janursa that he's also involved in benchmarking gene regulatory network inference methods, but with a different angle on a few things -- mainly concerning the what the ground truth information and metrics are. However, the methods will probably be quite similar.

@janursa would you be willing to discuss what your proposal concerning how to benchmark GRN inference methods for single-cell applications is?

Would be great if we could combine our efforts to get the best of both benchmarking experimental designs.

janursa commented 1 week ago

@stkmrc thanks for creating this issue and @rcannood thanks for involving me. Yes, i would be happy to share our approach and merge the ideas and efforts. I already contacted Marco to have a talk. I created a slack group and added you. We can discuss there and summarize the outcomes here

LuckyMD commented 1 week ago

Hi @stkmrc, A few comments also from my side:

Despite its importance, GRN inference from single-cell RNA-Seq data is challenged by the high dimensionality of the data, inherent data noise, sparsity of the data, sparsity of the networks to be inferred, the lack of known negative edges in the GRN (positive unlabeled setting) and the ambiguity of possible causal explanations for the data. Available computational approaches often struggle with these issues, leading to inaccurate or overfitted models.

Wouldn't ground truth generally be an issue as well and not only known negative edges? Even in the examples you suggest, I imagine there are quite a few caveats on the ground truth network structure, no?

Expected Output The output for both subtasks is a predicted GRN, represented as a graph where nodes are genes and edges indicate regulatory interactions.

Do you propose to output weighted edges or just showing direction? If weights are used, what should this signify?

General comments:

LuckyMD commented 1 week ago

Also, @janursa,

What do you think about keeping the discussions on this on github, so we have documentation for future community involvement?

stkmrc commented 1 week ago

@LuckyMD thanks for your comments! Check my answers below:

Wouldn't ground truth generally be an issue as well and not only known negative edges? Even in the examples you suggest, I imagine there are quite a few caveats on the ground truth network structure, no?

Certainly, though it's more a "Computational Challenge" of the evaluation than of the GRN inference task itself

Do you propose to output weighted edges or just showing direction? If weights are used, what should this signify?

That's a good point - for the AUC metrics we would need weighted edges for the outputs, but not for the ground truth (since it's a binary classification task). There's also the option to evaluate against weighted ground truth, or even signed (activation/repression) edges to add more detail - but since most available algorithms already perform poorly on the "easier" task of binary classification I wouldn't add more complexity in the first version

control methods/ shortest path

The topological metrics used in the STREAMLINE paper for this are computed as the difference between the predicted and ground truth network, so to compute this we need the ground truth control values. These metrics are then optimal (=0) when e.g. the average shortest path in the predicted network matches the value of the ground truth network (and not optimized the smaller for example the shortest paths are). Of course, we could construct also a metric out of the difference that is in range [0-1], if required - but I like to look at the signed difference because it not only tells you how close the topology value is to the ground truth, but also if it's over- or underestimated in the prediction.

Generalization of GRN structures

There exist separate GRNs (e.g. a GRN per Species) that I wouldn't consider directly linked as part of the "one underlying true network". For these, the "scale-free" topology is often referred to. But I agree, smaller subnetworks can have other topologies, even if the underlying larger network is scale-free. That's why in our topological benchmarking paper we didn't only focus on scale-free graphs and the metrics are not evaluating for example "how scale-free" a predicted graph is, but instead evaluates how close the topology is to the topology of the ground truth in a more unbiased way (even if the ground-truth is for example more a small-world network).

Metrics that don't rely on (exact) ground truth

For the experimental datasets this would be anyway planned, since we don't have an exact ground truth available. So the McCalla datasets for example provide both TF perturbation and TF Chip-Seq based ground truth networks we can evaluate against. If you have ideas for other resources or how to construct something similar for the simulated datasets, I would be happy to include those as well!