Gene dropout from modules and errors for alternative infer_grn methods

Hi @joschif !

Thank you for this helpful package :) I have a couple of questions!

To start, I will just say - I have already tested a lot of parameters in my multiome dataset and can use the default glm method without issue. I have been constraining my peaks to PhastCons 'Vert 35 El' + ENCODE CREs + any additional peak-gene linkages in my own dataset (I have also tried running versions without constraints). I previously tried using only my own peak-gene linkages, but this returned very small networks that were not useful, so now I am using all conserved regulatory regions again. This gives me a similar ballpark to the number of regions specified in the vignette for the test dataset (~140,000). I have been constraining my gene set to include only genes that are in Variable Features I am using for my analysis (~3000 genes). So I don't think any of the issues I am having below can be overcome by constraining anything further, as discussed in a few different GitHub posts. For reference as well, I am using a mouse dataset (so I am using mouse-compatible PhastCons, PWMs, etc. as well as the Signac peak-to-gene linkage method, as I believe there is still an issue with the GREAT method here).

1) I have been using the default glm model to try to generate a network and predict modules in my dataset. I find that certain genes I know to be biologically truthfully correlated (via experimental KO validation; ie genes whose expression is lost when a TF is knocked out) come up as having relatively 'strong' correlations in my dataset (e.g. corr >0.15) with adjusted p values < 0.05 within the 'coef' table. But then when I move to the next step and find modules, these genes and correlations often do not make it into the module for that TF, even though the TF itself does appear as a module, and I am using what I would think would be low enough thresholds for those genes to be 'kept' (e.g. R sq > 0.05, p val < 0.05 cutoffs). Can you explain how modules are chosen based on the output of 'Infer GRN' and why genes that seem to have a relatively high correlation and p value below the cutoff I've specified in the function don't get included as part of a TF's module? Perhaps I am misunderstanding how those cutoffs are applied?

2) Given that I am losing some key genes that come up in the coef tables from my TF modules, I wondered if this might change if I used a different method in infer grn. I tried to run GREAT instead of Signac for peak-gene links, and it was still erroring for me, presumably because of the mouse-to-human issue described here I also tried running the Bagging Ridge method with Signac peak method, which I was interested to try as I noticed the p values are much smaller in the vignette when this method is used. However, I couldn't get this working for me - I am using reticulate and have my python environment set up such that scikit-learn does load and is called successfully, but I get an error that 'fitting model failed for x gene' for every single gene in my dataset one-by-one when running the code (it prints like 3000 lines that fitting model has failed for each gene as it runs through, not the error message 'fitting model failed for all genes'). Inspecting the traceback() showed that this might also be related to human vs. mouse incompatibility? Here is the specific error I received:

12: makeRestartList(...)
11: withRestarts({
    signalCondition(cond)
    defaultHandler(cond)
  }, muffleMessage = function() NULL)
10: message(paste0(...))
9: log_message("Warning: Fitting model failed for ", g, verbose = verbose)
8: FUN(X[[i]], ...)
7: lapply(X[Split[[i]]], FUN, ...)
6: pbapply::pblapply(X = x, FUN = fun)
5: map_par(features, function(g) {
    if (!g %in% rownames(peaks2gene)) {
      log_message("Warning: ", g, " not found in EnsDb", verbose = verbose == 
        2)
      return()
    }
    gene_peaks <- as.logical(peaks2gene[g, ])
    if (sum(gene_peaks) == 0) {
      log_message("Warning: No peaks found near ", g, verbose = verbose == 
        2)
      return()
    }
    g_x <- gene_data[gene_groups, g, drop = FALSE]
    peak_x <- peak_data[peak_groups, gene_peaks, drop = FALSE]
    peak_g_cor <- as(sparse_cor(peak_x, g_x), "generalMatrix")
    peak_g_cor[[is.na](http://is.na/)(peak_g_cor)] <- 0
    peaks_use <- rownames(peak_g_cor)[abs(peak_g_cor[, 1]) > 
      peak_cor]
    if (length(peaks_use) == 0) {
      log_message("Warning: No correlating peaks found for ", 
        g, verbose = verbose == 2)
      return()
    }
    peak_x <- peak_x[, peaks_use, drop = FALSE]
    peak_motifs <- peaks2motif[gene_peaks, , drop = FALSE][peaks_use, 
      , drop = FALSE]
    gene_peak_tfs <- map(rownames(peak_motifs), function(p) {
      x <- as.logical(peak_motifs[p, ])
      peak_tfs <- colMaxs(motif2tf[x, , drop = FALSE])
      peak_tfs <- colnames(motif2tf)[as.logical(peak_tfs)]
      peak_tfs <- setdiff(peak_tfs, g)
      return(peak_tfs)
    })
    names(gene_peak_tfs) <- rownames(peak_motifs)
    gene_tfs <- purrr::reduce(gene_peak_tfs, union)
    tf_x <- gene_data[gene_groups, gene_tfs, drop = FALSE]
    tf_g_cor <- as(sparse_cor(tf_x, g_x), "generalMatrix")
    tf_g_cor[[is.na](http://is.na/)(tf_g_cor)] <- 0
    tfs_use <- rownames(tf_g_cor)[abs(tf_g_cor[, 1]) > tf_cor]
    if (length(tfs_use) == 0) {
      log_message("Warning: No correlating TFs found for ", 
        g, verbose = verbose == 2)
      return()
    }
    tf_g_corr_df <- as_tibble(tf_g_cor[unique(tfs_use), , drop = F], 
      rownames = "tf", .name_repair = "check_unique") %>% rename(tf = 1, 
      corr = 2)
    frml_string <- map(names(gene_peak_tfs), function(p) {
      peak_tfs <- gene_peak_tfs[[p]]
      peak_tfs <- peak_tfs[peak_tfs %in% tfs_use]
      if (length(peak_tfs) == 0) {
        return()
      }
      peak_name <- str_replace_all(p, "-", "_")
      tf_name <- str_replace_all(peak_tfs, "-", "_")
      formula_str <- paste(paste(peak_name, interaction_term, 
        tf_name, sep = " "), collapse = " + ")
      return(list(tfs = peak_tfs, frml = formula_str))
    })
    frml_string <- frml_string[!map_lgl(frml_string, is.null)]
    if (length(frml_string) == 0) {
      log_message("Warning: No valid peak:TF pairs found for ", 
        g, verbose = verbose == 2)
      return()
    }
    target <- str_replace_all(g, "-", "_")
    model_frml <- as.formula(paste0(target, " ~ ", paste0(map(frml_string, 
      function(x) x$frml), collapse = " + ")))
    nfeats <- sum(map_dbl(frml_string, function(x) length(x$tfs)))
    gene_tfs <- purrr::reduce(map(frml_string, function(x) x$tfs), 
      union)
    gene_x <- gene_data[gene_groups, union(g, gene_tfs), drop = FALSE]
    model_mat <- as.data.frame(cbind(gene_x, peak_x))
    if (scale) 
      model_mat <- as.data.frame(scale(as.matrix(model_mat)))
    colnames(model_mat) <- str_replace_all(colnames(model_mat), 
      "-", "_")
    log_message("Fitting model with ", nfeats, " variables for ", 
      g, verbose = verbose == 2)
    result <- try(fit_model(model_frml, data = model_mat, method = method, 
      ...), silent = TRUE)
    if (any(class(result) == "try-error")) {
      log_message("Warning: Fitting model failed for ", g, 
        verbose = verbose)
      log_message(result, verbose = verbose == 2)
      return()
    }
    else {
      result$gof$nvariables <- nfeats
      result$corr <- tf_g_corr_df
      return(result)
    }
  }, verbose = verbose, parallel = parallel)
4: fit_grn_models.GRNData(object = object, genes = genes, network_name = network_name, 
    peak_to_gene_method = peak_to_gene_method, upstream = upstream, 
    downstream = downstream, extend = extend, only_tss = only_tss, 
    parallel = parallel, tf_cor = tf_cor, peak_cor = peak_cor, 
    aggregate_rna_col = aggregate_rna_col, aggregate_peaks_col = aggregate_peaks_col, 
    method = method, alpha = alpha, family = family, interaction_term = interaction_term, 
    adjust_method = adjust_method, scale = scale, verbose = verbose, 
    ...)
3: fit_grn_models(object = object, genes = genes, network_name = network_name, 
    peak_to_gene_method = peak_to_gene_method, upstream = upstream, 
    downstream = downstream, extend = extend, only_tss = only_tss, 
    parallel = parallel, tf_cor = tf_cor, peak_cor = peak_cor, 
    aggregate_rna_col = aggregate_rna_col, aggregate_peaks_col = aggregate_peaks_col, 
    method = method, alpha = alpha, family = family, interaction_term = interaction_term, 
    adjust_method = adjust_method, scale = scale, verbose = verbose, 
    ...)
2: infer_grn.GRNData(whole_so_use, model = "bagging_ridge", peak_to_gene_method = "Signac", 
    genes = expanded_vf_no_hs, upstream = 1e+05, downstream = 1e+05, 
    parallel = F, tf_cor = 0.05)
1: infer_grn(whole_so_use, model = "bagging_ridge", peak_to_gene_method = "Signac", 
    genes = expanded_vf_no_hs, upstream = 1e+05, downstream = 1e+05, 
    parallel = F, tf_cor = 0.05)

Is this because of human-to-mouse genome incompatibility, or potentially some other issue? And do you have any other suggestions on how I could try to capture more of the gene-TF pairs in my network within predicted modules for downstream analysis?

Thank you so much for your help with this and for providing this tool!

quadbio / Pando

Gene dropout from modules and errors for alternative infer_grn methods #66