quadbio / Pando

Multiome GRN inference.
https://quadbio.github.io/Pando/
MIT License
106 stars 21 forks source link

Fitting model failed for all genes #1

Closed smorabit closed 2 years ago

smorabit commented 2 years ago

Hi,

I am very interested in using Pando to aid in the analysis of my paired snATAC-seq & snRNA-seq of the same samples. I am trying to follow along with the tutorial using my own data. However I am running into an issue when I run the infer_grn function where the model fails to fit on all genes.

Briefly, I have processed my snATAC-seq data using Signac, my snRNA-seq data using LIGER + Seurat. I integrated the data using Seurat's integration platform, and used the anchors to impute gene expression for the variable genes in the snATAC-seq data, so I have a Seurat object containing a peaks accessibility matrix & gene expression, which I believe is a suitable pre-requisite for running Pando.

Before running Pando, I subset my seurat object to contain only one major cell lineage that I am interested in investigating further (31,557 cells total). I ran the initiate_grn function, and then the find_motifs function as suggested.

I am wondering if you might have any idea why the model would fail for all genes? Below is some relevant code.


library(Pando)
library(BSgenome.Mmusculus.UCSC.mm10)
data(motifs)

seurat_celltype <- initiate_grn(
  seurat_celltype,
  peak_assay = 'peaks',
  rna_assay = 'RNA',
  exclude_exons = FALSE
)

seurat_celltype <- find_motifs(
    seurat_celltype ,
    pfm = motifs,
    genome = BSgenome.Mmusculus.UCSC.mm10
)

# this one fails for all genes!!
seurat_celltype <- infer_grn(
    seurat_celltype,
    peak_to_gene_method = 'Signac',
    method = 'glm',
    prallel=TRUE
)

Output:

Selecting candidate regulatory regions near genes
Preparing model input
Fitting models for 3087 target genes
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=12m 33s
Warning: Fitting model failed for all genes.

I am using exclude_exons = FALSE as it gives an error if I don't do that. However I think it is a problem separate from Pando, because a small handful of my peaks actually overlap each other, the length of "regions" and "peaks" end up different after the initiate_grn step.

I would greatly appreciate your help in trying to solve this issue!

Sam

joschif commented 2 years ago

Dear Sam,

sorry for the late reply. If you set verbose=2 in infer_grn(), it should print the actual errors of the fitting functions. It seems like I should make this the default. Maybe try this and see what you get.

Also, not quite sure if this is the cause of your problem, but there is a typo in you arguments to infer_grn(): prallel should be parallel.

To your second point, what is the error you see when setting exclude_exons = FALSE?

Cheers, Jonas

smorabit commented 2 years ago

It looks like when I run the verbose option I am getting a whole lot of these messages. Not sure why this would be happening?

Warning: No correlating TFs found for Olfr273
Warning: No correlating TFs found for Abca1
Warning: No correlating TFs found for Zfp462
Warning: No correlating TFs found for Klf4
Warning: No correlating TFs found for Frrs1l
Warning: No correlating TFs found for Txn1
Warning: No correlating TFs found for Svep1
Warning: No correlating TFs found for Musk
Warning: No correlating peaks found for Mup3
Warning: No correlating TFs found for Tnfsf15
Warning: No correlating TFs found for Tnfsf8
Warning: No correlating TFs found for Tnc
Warning: No correlating TFs found for Pappa
Warning: No correlating TFs found for Astn2
Warning: No correlating TFs found for Brinp1

Warning: No correlating TFs found for Rasef
Warning: No correlating TFs found for Frmd3
Warning: No correlating peaks found for Gm5860
Warning: No correlating TFs found for Frem1
Warning: No correlating TFs found for Bnc2
Warning: No correlating TFs found for Sh3gl2
Warning: No correlating TFs found for Plin2
Warning: No correlating TFs found for Acer2
Warning: No correlating TFs found for Elavl2
Warning: No correlating TFs found for Tek
Warning: No correlating TFs found for Cyp2j12
Warning: No correlating TFs found for Kank4
Warning: No correlating TFs found for Dock7
Warning: No correlating TFs found for Foxd3
Warning: No correlating TFs found for Ror1
Warning: No correlating TFs found for Cachd1
Warning: No correlating TFs found for Lepr
Warning: No correlating TFs found for Dab1
Warning: No correlating TFs found for Tmem61
Warning: No correlating TFs found for Acot11
Warning: No correlating TFs found for Glis1
Warning: No correlating TFs found for Elavl4
Warning: No correlating TFs found for Agbl4
Warning: No correlating TFs found for Trabd2b
Warning: No correlating TFs found for Foxd2os

For the other issue, I am actually seeing an error when exclude_exons = TRUE so I was doing exclude_exons = FALSE instead. Essentially I have called peaks on my data using the ArchR workflow, and I think that a small handful of peaks are actually overlapping one another, so when IRanges::intersect is called within initiate_grn on line 50 inregions.R, the dimensions don't match from the resulting ranges don't match the peaks attributes. I am not sure if Pando should necessarily account for this since it seems like an edge case that might not pop up very often?

regions <- NetworkRegions(seurat_celltype)
regions@ranges %>% length 
regions@peaks %>% length
joschif commented 2 years ago

Hi Sam, thanks for the update.

Concerning the first issue, infer_grn() has an argument tf_cor, where the default is 0.1 to prefilter TFs with some correlation to the target genes expression. You could set that to 0 and see what you get. If the messages persist, we might have to do some troubleshooting.

As for the second issue, it's quite possible that your diagnosis is right. I did not really have overlapping peaks in mind when I wrote it, but I'll look into it :)

smorabit commented 2 years ago

Okay I tried using tf_cor = 0, but I am still getting this at the end: Warning: Fitting model failed for all genes. So looking at line 202 of grn.R makes me think that all of the correlations are 0 somehow?

Edit:

I noticed that one of the first things that is done in the infer_grn function is extracting some information about the motifs via NetworkTFs(seurat_object). For my data this object is of shape 1590 by 0. Could this be the problem? In the Pando tutorial the find_motifs function is called the following way:

seurat_object <- find_motifs(
    seurat_object,
    pfm = motifs,
    genome = BSgenome.Hsapiens.UCSC.hg38
)

I am working with mouse data so all I did was this:


seurat_celltype <- find_motifs(
    seurat_object ,
    pfm = motifs,
    genome = BSgenome.Mmusculus.UCSC.mm10
)

Perhaps there was some issue here where the data should not be of shape 1590 motifs by 0?

joschif commented 2 years ago

Ah thanks for the info, I think I have a suspicion what could have gone wrong. The function find_motifs() takes an optional argument motif_tfs where one can provide a dataframe mapping motifs to their respective TFs. If left empty, the function simply defaults to a manually curated one for human TFs that is provided with Pando. However since your dataset is mouse, none of the TFs will be found in the dataset and the resulting matrix will be empty. This then leads to the Warning: No correlating TFs found for ... warning, because I did not consider that the matrix could already be empty before filtering by correlation. I added a new error to check for this earlier.

To fix it, you can try to get a motif and TF set for mouse, for instance from JASPAR. find_motifs() takes as pfm a PFMatrixList as returned e.g. by the TFBSTools function getMatrixSet(). Additionally, you would need to supply a dataframe in motif_tfs with the first column being the motif name and the second being the TF name.

I hope this helps, let me know how it goes :)

smorabit commented 2 years ago

JASPAR seems to have much fewer motifs for mouse, so what I ended up doing was use the human list but only keep genes with a one to one ortholog across mouse & human. Then I made the motif_tfs dataframe as you suggested and ran the find_motifs function.

pfm <- getMatrixSet(
  x = JASPAR2020,
  opts = list(species = 9606, all_versions = FALSE)
)
seurat_celltype <- find_motifs(
    seurat_celltype ,
    pfm = pfm,
    genome = BSgenome.Mmusculus.UCSC.mm10,
    motif_tfs = motif_tfs
)

This seems to work, the code takes a lot longer to run and I am not getting the No correlating TFs error. However I did run into a subsequent error after the code had already been running for a very long time.

Error in { : task 34 failed - "<text>:1:8: unexpected symbol
1: 2610316D01Rik
           ^"

There seems to be an issue with a certain gene. The gene is present in the data but maybe it somehow doesn't like the way it's formatted? No idea. Thank you so much for all of your help so far.

Edit:

I removed all of the "Rik" genes from my dataset, and infer_grn was able to run successfully! I went on to run find_modules which worked as well. Thank you for helping me resolve these issues.

joschif commented 2 years ago

Ok great, then at least we solved that problem. I'm not quite sure what the issue with the gene is 🤔 . Do you know at which stage it fails? The function should print what is does like Adding TF info, Building motif matrix etc. Otherwise, glad it worked out in the end :) Can I close the issue?

ejs18 commented 2 years ago

Hello,

I am trying to run Pando on Drosophila data and am running into a similar problem above with the "unexpected symbol" error when running infer_grn(). The error message I get comes after the "Fitting models for x genes" message and says, for example,

Error in str2lang(x) : :1:11: unexpected symbol 1: Myo81F ~ 3R_1268406_1269303 ^

If I specify certain genes in infer_grn(), some work and some don't, with no discernable pattern as to characters in gene name, location of the gene in the genome, whether the gene is a variable feature, or anything else I think of. For example, Myc, gfzf, msps, and Osi24 do not return this error, but twi, stumps, sna, and Doc1 do.

Did you ever solve the problem of this error, or have any ideas what I could look into to resolve it?

Thank you!

zhiyhu commented 1 year ago

I encountered the same error. In my case it is:

Error in str2lang(x) : <text>:1:11: unexpected input
1: myl10 ~ 10_
                        ^

After going through the infer_grn() function line by line, this error comes out from the following lines from function fit_grn_models.SeuratPlus.

    model_frml <- as.formula(
      paste0(target, ' ~  ', paste0(map(frml_string, function(x) x$frml),  collapse=' + '))
    )

The cause is likely to be due to the fact that the peak names start with integer but not character. Renaming these peak names at the very beginning (they cannot be changed in Seurat object at the moment) will probably resolve this

GouQiao commented 12 months ago

Hi , I am also using mouse data. COuld you explain how to generate motif_tfs dataframe ?