quadbio / Pando

Multiome GRN inference.
https://quadbio.github.io/Pando/
MIT License
106 stars 21 forks source link

Error "attempt to select less than one element in integerOneIndex" #18

Closed lingfeiwang closed 1 year ago

lingfeiwang commented 1 year ago

Hi there. Thank you for the nice tool.

I tried to run version 1.0.1 on an independent dataset following your tutorial, but got the error below. How can I solve that please?

> # Combining https://stuartlab.org/signac/articles/pbmc_multiomic.html and https://quadbiolab.github.io/Pando/index.html
> library(Signac)
> library(Seurat)
Attaching SeuratObject
> library(EnsDb.Hsapiens.v86)
Loading required package: ensembldb
Loading required package: BiocGenerics

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which.max, which.min

Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: S4Vectors

Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:base’:

    expand.grid, I, unname

Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: GenomicFeatures
Loading required package: AnnotationDbi
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: AnnotationFilter

Attaching package: 'ensembldb'

The following object is masked from 'package:stats':

    filter

> library(BSgenome.Hsapiens.UCSC.hg38)
Loading required package: BSgenome
Loading required package: Biostrings
Loading required package: XVector

Attaching package: 'Biostrings'

The following object is masked from 'package:base':

    strsplit

Loading required package: rtracklayer
> library(Pando)
> 
> set.seed(1234)
> counts <- Read10X_h5("pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5")
Genome matrix has multiple modalities, returning a list of matrices for this genome
> counts$Peaks=counts$Peaks[startsWith(rownames(counts$Peaks),"chr"),]
> fragpath <- "pbmc_granulocyte_sorted_10k_atac_fragments_filtered.tsv.gz"
> annotation <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86)
There were 24 warnings (use warnings() to see them)
> pbmc <- CreateSeuratObject(
+   counts = counts$`Gene Expression`,
+   assay = "RNA"
+ )
> atac=CreateChromatinAssay(
+   counts = counts$Peaks,
+   sep = c(":", "-"),
+   fragments = fragpath,
+   annotation = annotation
+ )
Computing hash
Checking for 11909 cell barcodes
> pbmc[["peaks"]]=atac
> s=initiate_grn(pbmc)
Warning message:
In .Seqinfo.mergexy(x, y) :
  The 2 combined objects have no sequence levels in common. (Use
  suppressWarnings() to suppress this warning.)
> s=find_motifs(s,pfm=motifs,genome=BSgenome.Hsapiens.UCSC.hg38)
Adding TF info

Building motif matrix
Finding motif positions
Creating Motif object
Warning messages:
1: In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': chrM, chr1_GL383518v1_alt, chr1_GL383519v1_alt, chr1_GL383520v2_alt, chr1_KI270759v1_alt, chr1_KI270760v1_alt, chr1_KI270761v1_alt, chr1_KI270762v1_alt, chr1_KI270763v1_alt, chr1_KI270764v1_alt, chr1_KI270765v1_alt, chr1_KI270766v1_alt, chr1_KI270892v1_alt, chr2_GL383521v1_alt, chr2_GL383522v1_alt, chr2_GL582966v2_alt, chr2_KI270767v1_alt, chr2_KI270768v1_alt, chr2_KI270769v1_alt, chr2_KI270770v1_alt, chr2_KI270771v1_alt, chr2_KI270772v1_alt, chr2_KI270773v1_alt, chr2_KI270774v1_alt, chr2_KI270775v1_alt, chr2_KI270776v1_alt, chr2_KI270893v1_alt, chr2_KI270894v1_alt, chr3_GL383526v1_alt, chr3_JH636055v2_alt, chr3_KI270777v1_alt, chr3_KI270778v1_alt, chr3_KI270779v1_alt, chr3_KI270780v1_alt, chr3_KI270781v1_alt, chr3_KI270782v1_alt, chr3_KI270783v1_alt, chr3_KI270784v1_alt, chr3_KI270895v1_alt, chr3_KI270924v1_alt, chr3_KI270934v1_alt, chr3_KI270935v1_alt, chr3_KI270936v1_alt, chr3_KI270937v1_alt, chr4_GL000 [... truncated]
2: In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': chrM, chr1_GL383518v1_alt, chr1_GL383519v1_alt, chr1_GL383520v2_alt, chr1_KI270759v1_alt, chr1_KI270760v1_alt, chr1_KI270761v1_alt, chr1_KI270762v1_alt, chr1_KI270763v1_alt, chr1_KI270764v1_alt, chr1_KI270765v1_alt, chr1_KI270766v1_alt, chr1_KI270892v1_alt, chr2_GL383521v1_alt, chr2_GL383522v1_alt, chr2_GL582966v2_alt, chr2_KI270767v1_alt, chr2_KI270768v1_alt, chr2_KI270769v1_alt, chr2_KI270770v1_alt, chr2_KI270771v1_alt, chr2_KI270772v1_alt, chr2_KI270773v1_alt, chr2_KI270774v1_alt, chr2_KI270775v1_alt, chr2_KI270776v1_alt, chr2_KI270893v1_alt, chr2_KI270894v1_alt, chr3_GL383526v1_alt, chr3_JH636055v2_alt, chr3_KI270777v1_alt, chr3_KI270778v1_alt, chr3_KI270779v1_alt, chr3_KI270780v1_alt, chr3_KI270781v1_alt, chr3_KI270782v1_alt, chr3_KI270783v1_alt, chr3_KI270784v1_alt, chr3_KI270895v1_alt, chr3_KI270924v1_alt, chr3_KI270934v1_alt, chr3_KI270935v1_alt, chr3_KI270936v1_alt, chr3_KI270937v1_alt, chr4_GL000 [... truncated]
> s=infer_grn(s,peak_to_gene_method='Signac',method='glm')
Selecting candidate regulatory regions near genes
Error in .subset2(x, i, exact = exact) : 
  attempt to select less than one element in integerOneIndex
Calls: infer_grn ... as.factor -> is.factor -> [[ -> [[.data.frame -> <Anonymous>
Execution halted
lingfeiwang commented 1 year ago

Also, how do I infer stage- or branch-specific gene regulatory networks as in your publication? I did not find this information in your tutorial.

joschif commented 1 year ago

The PBMC data you are using should be publically accessible somewhere, right? Could you point me to it then I can try to replicate the error.

lingfeiwang commented 1 year ago

Sure. It's in https://stuartlab.org/signac/articles/pbmc_multiomic.html .

wget https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5
wget https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz
wget https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz.tbi
joschif commented 1 year ago

Hi @lingfeiwang, I believe the error you are seeing stems from the fact that your peaks and the gene annotations have different sequence annotations. You can solve this by setting

seqlevelsStyle(annotation) <- 'UCSC'

before creating the chromatin assay.

lingfeiwang commented 1 year ago

Thanks, joschif!

The warning messages are gone but the error is still there:

R version 4.1.3 (2022-03-10) -- "One Push-Up"
Copyright (C) 2022 The R Foundation for Statistical Computing
Platform: x86_64-conda-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> # Combining https://stuartlab.org/signac/articles/pbmc_multiomic.html and https://quadbiolab.github.io/Pando/index.html
> library(Signac)
> library(Seurat)
Attaching SeuratObject
> library(EnsDb.Hsapiens.v86)
Loading required package: ensembldb
Loading required package: BiocGenerics

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which.max, which.min

Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: S4Vectors

Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:base’:

    expand.grid, I, unname

Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: GenomicFeatures
Loading required package: AnnotationDbi
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: AnnotationFilter

Attaching package: 'ensembldb'

The following object is masked from 'package:stats':

    filter

> library(BSgenome.Hsapiens.UCSC.hg38)
Loading required package: BSgenome
Loading required package: Biostrings
Loading required package: XVector

Attaching package: 'Biostrings'

The following object is masked from 'package:base':

    strsplit

Loading required package: rtracklayer
> library(Pando)
> 
> set.seed(1234)
> counts <- Read10X_h5("pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5")
Genome matrix has multiple modalities, returning a list of matrices for this genome
> counts$Peaks=counts$Peaks[startsWith(rownames(counts$Peaks),"chr"),]
> fragpath <- "pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz"
> annotation <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86)
There were 24 warnings (use warnings() to see them)
> seqlevelsStyle(annotation) <- 'UCSC'
> pbmc <- CreateSeuratObject(
+   counts = counts$`Gene Expression`,
+   assay = "RNA"
+ )
> atac=CreateChromatinAssay(
+   counts = counts$Peaks,
+   sep = c(":", "-"),
+   fragments = fragpath,
+   annotation = annotation
+ )
Computing hash
Checking for 11909 cell barcodes
> pbmc[["peaks"]]=atac
> s=initiate_grn(pbmc)
> s=find_motifs(s,pfm=motifs,genome=BSgenome.Hsapiens.UCSC.hg38)
Adding TF info

Building motif matrix
Finding motif positions
Creating Motif object
> s=infer_grn(s,peak_to_gene_method='Signac',method='glm')
Selecting candidate regulatory regions near genes
Error in .subset2(x, i, exact = exact) : 
  attempt to select less than one element in integerOneIndex
Calls: infer_grn ... as.factor -> is.factor -> [[ -> [[.data.frame -> <Anonymous>
Execution halted
joschif commented 1 year ago

Hm, for me it runs fine on the same dataset. Could you maybe send the full traceback (traceback()) of the error?

lingfeiwang commented 1 year ago

Sure! Here it is.

> traceback()
13: (function(x, i, exact) if (is.matrix(i)) as.matrix(x)[[i]] else .subset2(x, 
        i, exact = exact))(x, ..., exact = exact)
12: `[[.data.frame`(args, i)
11: args[[i]]
10: is.factor(x)
9: as.factor(args[[i]])
8: interaction(groupings2, sep = "_")
7: data.frame(interaction(groupings2, sep = "_"))
6: fast_aggregate(x = x, groupings = groups, fun = fun)
5: aggregate_matrix(t(peaks_near_gene), groups = colnames(peaks_near_gene), 
       fun = "sum")
4: fit_grn_models.SeuratPlus(object = object, genes = genes, network_name = network_name, 
       peak_to_gene_method = peak_to_gene_method, upstream = upstream, 
       downstream = downstream, extend = extend, only_tss = only_tss, 
       parallel = parallel, tf_cor = tf_cor, peak_cor = peak_cor, 
       aggregate_rna_col = aggregate_rna_col, aggregate_peaks_col = aggregate_peaks_col, 
       method = method, alpha = alpha, family = family, interaction_term = interaction_term, 
       adjust_method = adjust_method, scale = scale, verbose = verbose, 
       ...)
3: fit_grn_models(object = object, genes = genes, network_name = network_name, 
       peak_to_gene_method = peak_to_gene_method, upstream = upstream, 
       downstream = downstream, extend = extend, only_tss = only_tss, 
       parallel = parallel, tf_cor = tf_cor, peak_cor = peak_cor, 
       aggregate_rna_col = aggregate_rna_col, aggregate_peaks_col = aggregate_peaks_col, 
       method = method, alpha = alpha, family = family, interaction_term = interaction_term, 
       adjust_method = adjust_method, scale = scale, verbose = verbose, 
       ...)
2: infer_grn.SeuratPlus(s, peak_to_gene_method = "Signac", method = "glm")
1: infer_grn(s, peak_to_gene_method = "Signac", method = "glm")
joschif commented 1 year ago

I believe the issue could be that you did not give any genes for GRN inference and also did not run FindVariableFeatures() before, which would be the default gene set Pando uses. Can you run this code and tell me if it works for you (for me it does):

counts <- Read10X_h5('../data/pbmc/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5')
counts$Peaks=counts$Peaks[startsWith(rownames(counts$Peaks),'chr'),]

fragpath <- '../data/pbmc/pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz'
annotation <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86)
seqlevelsStyle(annotation) <- 'UCSC'

pbmc <- CreateSeuratObject(counts = counts$`Gene Expression`,assay = 'RNA')
atac = CreateChromatinAssay(
    counts = counts$Peaks,
    sep = c(':', '-'),
    fragments = fragpath,
    annotation = annotation
)
pbmc[['peaks']] = atac

pbmc_grn = initiate_grn(pbmc)
pbmc_grn = find_motifs(pbmc_grn, pfm=motifs, genome=BSgenome.Hsapiens.UCSC.hg38)

pbmc_grn <- FindVariableFeatures(pbmc_grn, nfeatures=100)

pbmc_grn = inferd_grn(pbmc_grn, genes=VariableFeatures(pbmc_grn), parallel=F, verbose = 2)
pbmc_grn = find_modules(pbmc_grn)
lingfeiwang commented 1 year ago

Thanks. You code above indeed runs without error.

However, I'm trying to obtain a network for all target genes instead of the top 100. That's why I removed the genes parameter in the first place. Based your code here, I had another version but it came with the following error. Does pando support all target genes and how would you achieve that?

> s=Seurat::FindVariableFeatures(s,assay='RNA',nfeatures=99999)
Calculating gene variances
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Calculating feature variances of standardized and clipped values
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
> s=infer_grn(s,peak_to_gene_method='Signac',method='glm', genes=VariableFeatures(s))
Selecting candidate regulatory regions near genes
Preparing model input
Fitting models for 18140 target genes
  |+++++                                             | 10% ~01d 00h 57m 01sError in .local(x, i, j, ..., value) : 
  not-yet-implemented 'Matrix[<-' method
> traceback()
12: stop("not-yet-implemented 'Matrix[<-' method")
11: .local(x, i, j, ..., value)
10: `[<-`(`*tmp*`, is.na(tf_g_cor), value = 0)
9: `[<-`(`*tmp*`, is.na(tf_g_cor), value = 0)
8: FUN(X[[i]], ...)
7: lapply(X[Split[[i]]], FUN, ...)
6: pbapply::pblapply(X = x, FUN = fun)
5: map_par(features, function(g) {
       if (!g %in% rownames(peaks2gene)) {
           log_message("Warning: ", g, " not found in EnsDb", verbose = verbose == 
               2)
           return()
       }
       gene_peaks <- as.logical(peaks2gene[g, ])
       if (sum(gene_peaks) == 0) {
           log_message("Warning: No peaks found near ", g, verbose = verbose == 
               2)
           return()
       }
       g_x <- gene_data[gene_groups, g, drop = FALSE]
       peak_x <- peak_data[peak_groups, gene_peaks, drop = FALSE]
       peak_g_cor <- as(sparse_cor(peak_x, g_x), "generalMatrix")
       peak_g_cor[is.na(peak_g_cor)] <- 0
       peaks_use <- rownames(peak_g_cor)[abs(peak_g_cor[, 1]) > 
           peak_cor]
       if (length(peaks_use) == 0) {
           log_message("Warning: No correlating peaks found for ", 
               g, verbose = verbose == 2)
           return()
       }
       peak_x <- peak_x[, peaks_use, drop = FALSE]
       peak_motifs <- peaks2motif[gene_peaks, , drop = FALSE][peaks_use, 
           , drop = FALSE]
       gene_peak_tfs <- map(rownames(peak_motifs), function(p) {
           x <- as.logical(peak_motifs[p, ])
           peak_tfs <- colMaxs(motif2tf[x, , drop = FALSE])
           peak_tfs <- colnames(motif2tf)[as.logical(peak_tfs)]
           peak_tfs <- setdiff(peak_tfs, g)
           return(peak_tfs)
       })
       names(gene_peak_tfs) <- rownames(peak_motifs)
       gene_tfs <- purrr::reduce(gene_peak_tfs, union)
       tf_x <- gene_data[gene_groups, gene_tfs, drop = FALSE]
       tf_g_cor <- sparse_cor(tf_x, g_x)
       tf_g_cor[is.na(tf_g_cor)] <- 0
       tfs_use <- rownames(tf_g_cor)[abs(tf_g_cor[, 1]) > tf_cor]
       if (length(tfs_use) == 0) {
           log_message("Warning: No correlating TFs found for ", 
               g, verbose = verbose == 2)
           return()
       }
       tf_g_corr_df <- as_tibble(tf_g_cor[unique(tfs_use), , drop = F], 
           rownames = "tf", .name_repair = "check_unique") %>% rename(tf = 1, 
           corr = 2)
       frml_string <- map(names(gene_peak_tfs), function(p) {
           peak_tfs <- gene_peak_tfs[[p]]
           peak_tfs <- peak_tfs[peak_tfs %in% tfs_use]
           if (length(peak_tfs) == 0) {
               return()
           }
           peak_name <- str_replace_all(p, "-", "_")
           tf_name <- str_replace_all(peak_tfs, "-", "_")
           formula_str <- paste(paste(peak_name, interaction_term, 
               tf_name, sep = " "), collapse = " + ")
           return(list(tfs = peak_tfs, frml = formula_str))
       })
       frml_string <- frml_string[!map_lgl(frml_string, is.null)]
       if (length(frml_string) == 0) {
           log_message("Warning: No valid peak:TF pairs found for ", 
               g, verbose = verbose == 2)
           return()
       }
       target <- str_replace_all(g, "-", "_")
       model_frml <- as.formula(paste0(target, " ~ ", paste0(map(frml_string, 
           function(x) x$frml), collapse = " + ")))
       nfeats <- sum(map_dbl(frml_string, function(x) length(x$tfs)))
       gene_tfs <- purrr::reduce(map(frml_string, function(x) x$tfs), 
           union)
       gene_x <- gene_data[gene_groups, union(g, gene_tfs), drop = FALSE]
       model_mat <- as.data.frame(cbind(gene_x, peak_x))
       if (scale) 
           model_mat <- as.data.frame(scale(as.matrix(model_mat)))
       colnames(model_mat) <- str_replace_all(colnames(model_mat), 
           "-", "_")
       log_message("Fitting model with ", nfeats, " variables for ", 
           g, verbose = verbose == 2)
       result <- try(fit_model(model_frml, data = model_mat, method = method, 
           ...), silent = TRUE)
       if (any(class(result) == "try-error")) {
           log_message("Warning: Fitting model failed for ", g, 
               verbose = verbose)
           log_message(result, verbose = verbose == 2)
           return()
       }
       else {
           result$gof$nvariables <- nfeats
           result$corr <- tf_g_corr_df
           return(result)
       }
   }, verbose = verbose, parallel = parallel)
4: fit_grn_models.SeuratPlus(object = object, genes = genes, network_name = network_name, 
       peak_to_gene_method = peak_to_gene_method, upstream = upstream, 
       downstream = downstream, extend = extend, only_tss = only_tss, 
       parallel = parallel, tf_cor = tf_cor, peak_cor = peak_cor, 
       aggregate_rna_col = aggregate_rna_col, aggregate_peaks_col = aggregate_peaks_col, 
       method = method, alpha = alpha, family = family, interaction_term = interaction_term, 
       adjust_method = adjust_method, scale = scale, verbose = verbose, 
       ...)
3: fit_grn_models(object = object, genes = genes, network_name = network_name, 
       peak_to_gene_method = peak_to_gene_method, upstream = upstream, 
       downstream = downstream, extend = extend, only_tss = only_tss, 
       parallel = parallel, tf_cor = tf_cor, peak_cor = peak_cor, 
       aggregate_rna_col = aggregate_rna_col, aggregate_peaks_col = aggregate_peaks_col, 
       method = method, alpha = alpha, family = family, interaction_term = interaction_term, 
       adjust_method = adjust_method, scale = scale, verbose = verbose, 
       ...)
2: infer_grn.SeuratPlus(s, peak_to_gene_method = "Signac", method = "glm", 
       genes = VariableFeatures(s))
1: infer_grn(s, peak_to_gene_method = "Signac", method = "glm", 
       genes = VariableFeatures(s))
joschif commented 1 year ago

With this error, I guessing you might be running an older version of Pando and it could help to upgrade. To control the set of genes used for GRN inference you can use the genes argument to infer_grn()

lingfeiwang commented 1 year ago

A followup question: Pando takes a long time to run for all target genes. We are already using 128 cores. Do you have any suggestions on how accelerate the computation?

joschif commented 1 year ago

Yeah if you are using all genes then the runtimes can be extremely long. I guess my recommendation would be to use fewer genes or fewer peak regions in the input, or both

lingfeiwang commented 1 year ago

I see. Thanks!