General Recommendations for preprocessing RNA-Seq data

nijibabulu commented 6 months ago

We are generally interested in detecting perturbed regulator proteins and their networks using causal networks in a de-novo fashion, which would in theory make sense to apply inverse CARNIVAL to solve. However, I am a unclear on how to preprocess the input genes and network to get a reliably solvable problem. My questions are generally:

Is there a current general guideline for how to preprocess inputs for CARNIVAL based off of single-contrast DE runs (either limma, DESeq2 or edgeR) as well as PKNs?
Is there also a way to estimate whether the input PKN+measurement combination may not be solvable?

Below I discuss what I have tried so far.

I noticed that in the available examples, the PKNs and measurement inputs are filtered by some subset of genes, usually TFs and pathway genes. (Below)

https://github.com/saezlab/transcriptutorial https://github.com/saezlab/CausalToxNet https://github.com/harrytr/p53

These often have pre-loaded networks and TFs, but I am trying to develop a pipeline. I tried two different methods based on those and got very different outcomes in terms of converging on a solution. I am starting with 17.5k gene statistics (based on running limma) as measurements. I then used the following:

ULM/collectri

# start with gene statistics in gene_mat 
net <- decoupleR::get_collectri(organism = 'human', split_complexes = FALSE)
gene_tf_ulm <- decoupleR::run_ulm(mat = gene_mat, net = net, .source = "source",
      .mor = "mor", .target = "target")
gene_tf_acts <- tibble::deframe(gene_tf_ulm[,c("source", "score")])
net_pruned <- CARNIVAL::prune_network(net, names(gene_tf_acts), names(gene_tf_acts)) |>
    dplyr::select(source, interaction=mor, target)
CARNIVAL::generateLpFileCarnival(measurements = gene_tf_acts,
                                 priorKnowledgeNetwork = net_pruned)

VIPER/Omnipath

This is adapted more or less from the p53 repository

regulon_df <- OmnipathR::import_tfregulons_interactions(organism = 9606)
network <-
    regulon_df |>
    dplyr::filter(is_directed == 1, is_inhibition + is_stimulation == 1,
                  dorothea_level %in% c("A", "B", "C")) |>
    dplyr::transmute(source = source_genesymbol,
                     interaction = -1 * is_inhibition + is_stimulation,
                     target = target_genesymbol) |>
    dplyr::distinct()

TF_activities <- decoupleR::run_viper(gene_t_mat, dorothea::dorothea_hs_pancancer,
    .source = "tf", .target = "target", .mor = "mor")
measurements <- TF_activities[,c("source", "score")]  |> tibble::deframe()
pathway_activities <-
    progeny::progeny(as.matrix(measurements), scale = TRUE, top = 100, perm = 1000, organism = "Human")

# snip: convert these activities to progeny_weights via the progenyMembers.Rdata

CARNIVAL::generateLpFileCarnival(measurements = measurements,
                                   priorKnowledgeNetwork = network,
                                   weights = progeny_weights)

The VIPER/Omnipath method is solvable quickly with CBC, but as far as I am aware from what I am reading, is somewhat out of date in comparison to the previous method.

Thank you for your help!

gabora commented 6 months ago

Hello

see the tutorial on how to produce CARNIVAL inputs, i.e. estimating TF activity: https://saezlab.github.io/decoupleR/articles/tf_bk.html

We are not using all DE genes as "measurements" in CARNIVAL. CARNIVAL models how sinaling leads to the regulators of differential gene expression, ie. to transcription factors. This is also why our prior knowledge network is a protein-protein interaction network and not a gene regulatory network.

About the solvability of problems: I cannot think of a case when the CARNIVAL optimization would lead to infeasibility, i.e. when some constraints are contradictory. I guess what you observe is the numerical issues with CBC. We also observed strong limitation with the non-commercial optimizers and thus we recommend Gurobi or cplex that are free in academia.

If you would work with this in a non-academic setting, I would suggest to try to eliminate nodes and edges from the prior knowledge network. We used a couple of strategies in cosmos (https://github.com/saezlab/cosmosR), e.g. (1) remove nodes from the network that are not expressed in any condition based on the RNAseq data (2) remove nodes that are not connected with a directed path to the measurement (they are not observable anyways) (3) remove nodes that are further than N number of steps from the inputs/measurements. E.g. N = 8. The rationale here is that the more steps we do on the network the less likely that our prediction makes any sense.

From these, mostly (3) has the strongest impact on the computational time. Maybe you need to go down to N=5 for CBC.

nijibabulu commented 6 months ago

Thank you for this very clear answer! This is all very helpful.

saezlab / CARNIVAL

General Recommendations for preprocessing RNA-Seq data #99

ULM/collectri

VIPER/Omnipath