Open Sandman-1 opened 1 year ago
Some details regarding my code include adapting the provided motif2tf data frame in Pando to decapitalize the letters of each transcription factor name after the first letter to make it compatible with BSgenome.Mmusculus.UCSC.mm10 and then subsetting the JASPAR2022 pfm object to only include transcription factors that are contained within the updated motif2tf data frame. I can edit my own motif2tf data frame object to keep transcription factors in singular fashion in the future so that I don't have to subset the pfm object in order to avoid the downstream colMaxs error, but I just wanted to keep things simple for this first iteration. Also, that last part of my code (tf network for Tcf7l2) does not even work because it's not listed as a transcription factor in the network graph (as shown in the screenshot). All this is really odd because my project revolves around showing the importance of this transcription factor in driving a lot of zonated hepatocyte activity.
Hi @Sandman-1, unfortunately I'm also not fully sure what the cause of your issues is. I'm aware that there are a number of possible pitfalls when adapting the builtin resources to other genomes. Also I would recommend constraining the input peaks and features somehow as this results in much better results in my experience. Cheers, J
@Sandman-1
Curious as to why the targets do not have gene names and have nomenclature like C6.... etc? I encountered the same problem.
Yeah that's quite odd, I haven't seen that before... could you maybe give me access to some example dataset where that happens? Could imagine that there is a problem with the motif to tf matching but not sure.
@joschif Can you help with a script that conforms to Homer or Jaspar or Cisbp motif datasets? chromVARmotifs::homer_pwms,chromVARmotifs::mouse_pwms_v2,chromVARmotifs::encode_pwms
I can try. Here's some code I used to convert JASPAR motifs to what is currently used by Pando, maybe its of use to you:
opts[['species']] <- 9606
opts[['all_versions']] <- T
collections <- c('CORE', 'CNE', 'PHYLOFACTS', 'SPLICE', 'POLII', 'FAM', 'PBM', 'PBM_HOMEO', 'PBM_HLH', 'UNVALIDATED')
names(collections) <- collections
motif_collection_list <- map(collections, function(c){
opts[['collection']] <- c
PFMatrixList <- getMatrixSet(JASPAR2020, opts)
motif2tf <- PFMatrixList@listData %>%
map_dfr(function(x){
tibble(
name = x@name,
collection = c,
tf = unlist(str_split(str_replace(x@name, '(.+)\\(.+\\)', '\\1'), '::')),
strand = x@strand,
symbol = ifelse(is.null(x@tags$symbol), '-', x@tags$symbol),
family = ifelse(is.null(x@tags$family), '-', x@tags$family)
)
}, .id='motif')
return(list(pfm=PFMatrixList, df=motif2tf))
})
motifs <- map(motif_collection_list, function(x) x$pfm) %>%
purrr::reduce(c)
motif2tf <- map_dfr(motif_collection_list, function(x) x$df, .id='collection')
@joschif
When running infer_grn() what can I enter for gene category because the target is still not showing a gene name when I run coef(muo_data).
target
should be the right column, if there are no correct gene names in there then I would assume that something went wrong prior to running infer_grn()
. I would need an example dataset that reproduces the problem to really figure out what's wrong though.
@joschif How can I share it? Can I upload rds file here?
I ran into the same issue with analyzing the mouse data and got very few TFs (40-50) with the same target gene named "F3". The analysis was pretty much the same as what @sylestiel did. and I prepared the motifs and motif2tf data as @joschif suggested above by changing the Taxonomy ID to 10090. Also tried the motifs from cisBP and got the same error. I am not sure whether this has anything to do with the inbuilt data sets such as the "EnsDb.Hsapiens.v93.annot.UCSC.hg38.RData" as I once got an error saying that genome(x) doesn't match genome(y) with one mm10 and one GRh38. But the error was somehow resolved after restarting R session. Any suggestions? Thanks a lot!
I rerun the infer_grn with
sobj_metacell <- infer_grn(sobj_metacell, peak_to_gene_method = 'GREAT', verbose = 2)
and got the following error message
Selecting candidate regulatory regions near genes
Preparing model input
Fitting models for 3783 target genes
...
Warning: Sorcs1 not found in EnsDb
Warning: Add3 not found in EnsDb
Warning: Rbm20 not found in EnsDb
Warning: Pdcd4 not found in EnsDb
...
only target genes like "F3", "C2", etc were successfully run.
I was not able to find the source code for infer_grn, but the "find_peaks_near_genes" function has the same log message
log_message('Selecting candidate regulatory regions near genes', verbose=verbose)
in the source code of "find_peaks_near_genes" function, there the inbuilt data "EnsDb.Hsapiens.v93.annot.UCSC.hg38" was used
else if (method == "GREAT") {
utils::data(EnsDb.Hsapiens.v93.annot.UCSC.hg38, envir = environment())
gene_annot_use <- EnsDb.Hsapiens.v93.annot.UCSC.hg38[which(EnsDb.Hsapiens.v93.annot.UCSC.hg38$gene_name %in%
genes$gene_name), ]
I am not sure whether this is also used in the infer_grn function. But as the genes with only one letter were successfully run, and those with more than one letter failed, (eg, F3 and F3 are the same in mouse and human, while Sorcs1 and SORCS1 are different in mouse and human), do you think this might be the problem? How would you suggest fixing it?
Thanks a lot!
Ah thanks, I think that might be the source of the problem. Have you tried running with peak_to_gene_method='Signac'
? The option 'GREAT' currently loads a custom version of the human EnsDB which makes it incompatible with other genomes. You can try to modify the function so that it accepts any ensdb as an argument, otherwise I would recommend just using the "Signac" method, which automatically uses the gene annotations provided in your Seurat object.
@joschif Thank you for the reply! I used the "Signac" method, but always got the following error
Selecting candidate regulatory regions near genes Preparing model input Fitting models for 3783 target genes Error in str2lang(x) :
:1:8: unexpected symbol 1: 1110015O18Rik ^
Do you have any suggestions on what might be the problem?
Yeah that's a known problem. The lm fromulas don't work with feature names starting with numbers. I haven't gotten around to fixing it, but I would suggest to filter these genes from the list of input features/TFs as they should be the minority and usually are not protein-coding genes anyway
Thanks a lot! After removing the genes starting with numbers, the code was successfully run and I got a lot of TFs and targeting genes!
I would still like to try the "GREAT" method. I created the mouse data as below
edb_m <- EnsDb.Mmusculus.v79
EnsDb.Mmusculus.v79.mm10 <- genes(edb_m,
columns = c(listColumns(edb_m, "tx"),
"gene_name", "gene_biotype"))
seqlevelsStyle(EnsDb.Mmusculus.v79.mm10) <- "UCSC"
and then changed the names of the elementmetadata as what you have in the human data.
did you also get the "EnsDb.Hsapiens.v93.annot.UCSC.hg38" data this way? If not, could you let me know how you did it? Thanks a lot!
@joschif @Sandman-1
How can I remove the Rik genes?
Wow, I missed a lot! Trying GRN Analysis using the Signac method then. Eager to see new results! :)))
@Sandman-1 @changwenwang09
Have either of you gotten it to work yet? If so, I could use some help here.
Thanks!
@sylestiel mine was done with this
params <- Params(sobj)
genes <- VariableFeatures(sobj, assay = params$rna_assay)
genes <- genes[!grepl("Rik", genes)]
hope it also works with your data :)
@changwenwang09
That was very helpful. It worked! Thank you very much!!!!
@changwenwang09 @Sandman-1
Did either of you get GREAT to work with mouse data?
Hello everybody, Thank you so much for your helpful information with your codes. @Sandman-1 @sylestiel, I am not very well in R, so Could you please provide me with the data('phastConsElements20Mammals.UCSC.mm10')? The Pando tool provided for the human, and when I tried to use: regions = StringToGRanges(Links(muo_data[['peaks']])$peak); it gave me a few linkPleaks:
I am using multiome data generated from mouse, so I tried a lot to solve this problem, but no success, so It would be helpful if you could provide me the mouse genome region to enable inferring the GRN.
I checked this website: http://hgdownload.cse.ucsc.edu/goldenpath/mm10/phastCons60way/ but failed to download through the linux command.
Thanks for your helpful with the forementioned codes in your comments.
Hi @Usamahussein551980
I had the exact same issues as you. Finally, I ended up with the following approach.
library(EnsDb.Mmusculus.v79)
DefaultAssay(muo_data) <- "ATAC"
annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Mmusculus.v79)
seqlevelsStyle(annotations) <- 'UCSC'
Annotation(muo_data) <- annotations
Hope it helps!
Hi @sylestiel Thank you very much for your information you provided, it helped me a lot. I successfully got there, so I got a lot of TFs.
I am still curious to use 'GREAT' for peak_to_gene_metod.
Good afternoon! I have been sporadically tinkering with my GRN analysis for some time, and I am running into various problems. For starters, the number of unique target genes detected by Pando is less than 10 in my dataset, even though I have adjusted all parameters to allow for 0 correlation and a p value as insignificant as 1 for all steps. When I look at the modules, fit, and coefs slots in the SeuratPlus object, I am also seeing some transcription factors written as pure numbers without any names. Please see my code below and the attached screenshots. I don't know what's wrong, and I also don't know how to explain these problems clearly. However, I would appreciate any support and am willing to provide additional information.
`pfm <- getMatrixSet( x = JASPAR2022, opts = list(collection = "CORE", tax_group = 'vertebrates', all_versions = FALSE))
x <- character() for(i in 1:length(pfm@listData)){ x[i] <- pfm@listData[[i]]@name }
motif_2_tf <- data.frame(motif = names(pfm@listData), tf = x, origin = "JASPAR", gene_id = NA, name = NA, family = NA, symbol = NA, motif_tf = NA) data("motif2tf") motif2tf <- subset(motif2tf, motif %in% motif_2_tf$motif)
firstup <- function(x) { substr(x, 1, 1) <- toupper(substr(x, 1, 1)) substr(x, 2, nchar(x)) <- tolower(substr(x, 2, nchar(x))) x }
motif2tf$tf <- firstup(motif2tf$tf)
TKO_int_cont_hep <- readRDS("~/TKO Multiome/Final Seurat Objects/TKO Control Hepatocytes Filtered and Integrated.rds") TKO_int_cont_hep <- FindVariableFeatures(TKO_int_cont_hep, assay = "RNA", nfeatures = nrow(TKO_int_cont_hep@assays$RNA)) %>% FindTopFeatures(assay = "Peaks", min.cutoff = 'q0') %>% RunTFIDF(assay = "Peaks")
TKO_int_cont_hep <- initiate_grn(TKO_int_cont_hep, peak_assay = "Peaks", rna_assay = "RNA", exclude_exons = FALSE) %>% find_motifs(pfm = pfm, motif_tfs = motif2tf, genome = BSgenome.Mmusculus.UCSC.mm10) x <- TKO_int_cont_hep@grn@regions@motifs x <- x[,unique(motif2tf$motif)] TKO_int_cont_hep@grn@regions@motifs <- x TKO_int_cont_hep <- infer_grn(TKO_int_cont_hep, peak_to_gene_method = "GREAT", upstream = 1e+06, downstream = 1e+06, method = "xgb", tf_cor = 0, scale = TRUE, interaction_term = "*") %>% find_modules(p_thresh = 1, nvar_thresh = 1, min_genes_per_module = 1, rsq_thresh = 0) %>% get_network_graph(graph_name = "full_graph", umap_method = "none") %>% get_tf_network(graph = "full_graph", tf = "Tcf7l2")`