quadbio / Pando

Multiome GRN inference.
https://quadbio.github.io/Pando/
MIT License
106 stars 21 forks source link

infer_grn error: invalid dimnames given for "dgCMatrix" object #5

Closed SeppeDeWinter closed 8 months ago

SeppeDeWinter commented 2 years ago

Hi

I was running the infer_grn function with a seurat object containing scRNA-seq and scATAC-seq data, both matrices are stored as spare matrices.

While running the function:

seurat_object_all_peak <- infer_grn(
                                seurat_object_all_peak,
                                peak_to_gene_method = 'Signac',
                                method = 'glm',
                                upstream = 150000,
                                downstream = 150000,
                                parallel = TRUE)

I get following error:

Selecting candidate regulatory regions near genes
Error in dimnamesGets(x, value) :
  invalid dimnames given for "dgCMatrix" object

This exception is thrown by this line of code: https://github.com/quadbiolab/Pando/blob/380a8eabea2314d90fa194d927deb8517ebc5655/R/grn.R#L131

SeppeDeWinter commented 2 years ago

Looking a bit further into the code, the dimension do indeed not match as the error suggested:

length(colnames(peak_data))
[1] 654389

length(rownames(regions@motifs@data))
[1] 653836
SeppeDeWinter commented 2 years ago

Digging even further into it we found out that we don't get the error when exclude_exons = FALSE and regions = NULL, in initiate_grn.

Basically when there is a one-to-one correspondence between cand_ranges and peak_ranges in: https://github.com/quadbiolab/Pando/blob/503c20299ec719c6bac79325deb2b8fc55da984b/R/regions.R#L59

SeppeDeWinter commented 2 years ago

I eventually solved this issue in the following way,

regions <- NetworkRegions(seurat_obj)
cand_ranges <- regions@ranges
peak_ranges <- StringToGRanges(rownames(GetAssay(seurat_obj, assay='ATAC')))
peak_overlaps <- findOverlaps(cand_ranges, peak_ranges)

peak_overlaps_df <- data.frame(peak_overlaps)

peak_overlaps_df <- peak_overlaps_df[!duplicated(peak_overlaps_df[['queryHits']]), ]
peak_overlaps_df <- peak_overlaps_df[!duplicated(peak_overlaps_df[['subjectHits']]), ]

new_motif_data <- regions@motifs@data[peak_overlaps_df[['queryHits']], ]
new_regions_peaks <- peak_overlaps_df[['subjectHits']]
new_ranges_data <- regions@ranges[peak_overlaps_df[['queryHits']]]

seurat_obj@grn@regions@motifs@data <- new_motif_data
seurat_obj@grn@regions@peaks <- new_regions_peaks
seurat_obj@grn@regions@ranges <- new_ranges_data

However this might cause potentially important regions to be lost. Any feedback on this?

joschif commented 2 years ago

Hi @SeppeDeWinter, sorry for the late response, I was on holiday last week. Thanks for your detailed description of the problem! I'll look into it and get back to you.

SeppeDeWinter commented 2 years ago

@joschif

Thank you for looking into it.

Do you have any estimate on the running time of Pando? The infer_grn function has now been running for 9 days already...

joschif commented 2 years ago

Hi @SeppeDeWinter, I looked into it a bit but unfortunately I'm not able to reproduce the error. For me it works even when excluding exons or subsetting to condidate regions. How did you call the peaks in your dataset? I think some callers give back overlapping peaks, which have produced a similar error in the past if I remember correctly. Merging overlapping peaks before running Pando solved the issue in the past.

Otherwise, could you maybe provide me with a minimal dataset that reproduces the error?

As for running time, which model are you using and what is roughly the size (ncells, npeaks, npeaks considered by Pando) of your dataset? I think for the basic models (glm, glmnet) our running times were on the order of hours (1-6) most of the time, even for reasonably large (~100k cells) datasets (using somewhere between 20-50 cores). However we've noticed that e.g. the brms models take a lot longer to fit, making them basically unfeasible unless you are radically constraining the considered feature set.

joschif commented 2 years ago

E.g. here overlapping peaks returned from ArchR turned out to be an issue.

This is hard to fix from within Pando, since it makes the accessibility of some sites ambiguous. I would therefore suggest merging the peaks before running Pando.

SeppeDeWinter commented 2 years ago

Hi @joschif

Thanks for looking into it.

We use a custom peak caller which already merges overlapping peaks. To confirm this I ran the following on the input data.

overlaps <- findOverlaps(seurat_obj@assays$ATAC@ranges, seurat_obj@assays$ATAC@ranges)
> all(data.frame(overlaps)[['queryHits']] == data.frame(overlaps)[['subjectHits']])
[1] TRUE

As you can see, our input peaks do not overlap (given that all regions returned by the findOverlaps function have a one to one correspondence).

The multi-mapping regions in fact occur due to the intersections that are made with the screen regions (SCREEN.ccRE.UCSC.hg38) within the initiate_grn.Seurat function of the package. i.e. one region in our input set may overlap with multiple screen regions and/or one region in the screen database might overlap with multiple regions in our dataset. I illustrate this below with code from the initiate_grn.Seurat function.

#set minimal arguments
object <- seurat_obj
regions <- SCREEN.ccRE.UCSC.hg38
peak_assay <- 'ATAC'
rna_assay <- 'RNA'

gene_annot <- Signac::Annotation(object[[peak_assay]])
peak_ranges <- StringToGRanges(rownames(GetAssay(object, assay=peak_assay)))

#regions is not null
cand_ranges <- IRanges::intersect(regions, peak_ranges)

#exclude_exons is set to TRUE
exon_ranges <- gene_annot[gene_annot$type=='exon', ]
names(exon_ranges@ranges) <- NULL
exon_ranges <- IRanges::intersect(exon_ranges, exon_ranges)
exon_ranges <- GenomicRanges::GRanges(
    seqnames = exon_ranges@seqnames,
    ranges = exon_ranges@ranges
)
cand_ranges <- IRanges::setdiff(cand_ranges, exon_ranges, ignore.strand=TRUE)

peak_overlaps <- findOverlaps(cand_ranges, peak_ranges)

all(data.frame(peak_overlaps)[['queryHits']] == data.frame(peak_overlaps)[['subjectHits']])
[1] FALSE

In this case there is not a one-to-one correspondence for each regions (hence the test returns FALSE). See below for an example of this:

> data.frame(peak_overlaps)[data.frame(peak_overlaps)$queryHits == 1325, ]
     queryHits subjectHits
1325      1325      149003
1326      1325      493706

It is these multi-mapping regions which cause the downstream issues as I mentioned above (https://github.com/quadbiolab/Pando/issues/5#issuecomment-1013832624).

As for the size of our dataset.

> seurat_obj@assays$RNA
Assay data with 56305 features for 4000 cells
Top 10 variable features:
 IGHM, AHSG, AFP, TF, COL3A1, DCN, SERPINA1, GPC3, ITIH2, GREM1

> seurat_obj@assays$ATAC
ChromatinAssay data with 642128 features for 4000 cells
Variable features: 635854

We are using the glm method.

joschif commented 2 years ago

Hi @SeppeDeWinter, thanks for the detailed example! The fact that there is not a 1-to-1 correspondence between peaks and candidate regions (should be 1-to-many) is something Pando should account for, or at least it is not unexpected behaviour. In my use cases this is almost always the case and so far it has not lead to a problem. Could you maybe provide the output of the following dims of an object where the error occurs when running infer_grn(), after running initiate_grn() and find_motifs():

regions <- NetworkRegions(seurat_obj)
length(regions@peaks)
length(unique(regions@peaks))
dim(regions@motifs@data)
dim(seurat_obj@assays$ATAC)

and

peak_ranges <- StringToGRanges(rownames(GetAssay(seurat_obj, assay='ATAC')))
cand_ranges <- regions@ranges
peak_overlaps <- findOverlaps(cand_ranges, peak_ranges)
length(intersect(peak_ranges, peak_ranges))
length(queryHits(peak_overlaps))
length(unique(queryHits(peak_overlaps)))
SeppeDeWinter commented 2 years ago

Hi,

Thank you for the quick response.

See output below.

> regions <- NetworkRegions(seurat_obj)
> length(regions@peaks)
[1] 476023
> length(unique(regions@peaks))
[1] 351073
> dim(regions@motifs@data)
[1] 475879   1590
> dim(seurat_obj@assays$ATAC)
[1] 642128   4000

and

> peak_ranges <- StringToGRanges(rownames(GetAssay(seurat_obj, assay='ATAC')))
> cand_ranges <- regions@ranges
> peak_overlaps <- findOverlaps(cand_ranges, peak_ranges)
> length(intersect(peak_ranges, peak_ranges))
[1] 641498
> length(queryHits(peak_overlaps))
[1] 476023
> length(unique(queryHits(peak_overlaps)))
[1] 475879

When I run

> seurat_obj <- infer_grn(
                    seurat_obj,
                    peak_to_gene_method = 'Signac',
                    method = 'glm',
                    upstream = 150000,
                    downstream = 150000,
                    parallel = TRUE)
Selecting candidate regulatory regions near genes
Error in dimnamesGets(x, value) :
  invalid dimnames given for "dgCMatrix" object

see: https://github.com/quadbiolab/Pando/issues/5#issue-1104986752

joschif commented 2 years ago

Good Morning, thanks for the Info! From the fact that length(intersect(peak_ranges, peak_ranges)) < dim(seurat_obj@assays$ATAC)[1] I would guess that even though the peaks in your dataset are not overlapping, some are directly adjacent, leading to a merge upon intersection. This then leads to the downstream issue, where dim(regions@motifs@data)[1] != length(regions@peaks). I will fix this behaviour as soon as I find some time.

SeppeDeWinter commented 2 years ago

Good morning @joschif.

Thank you!

Best,

Seppe

SeppeDeWinter commented 1 year ago

Hi @joschif

Could it be that this still is an issue? Yesterday I reinstalled pando and got the following error:


> muo_data <- infer_grn(
    muo_data,
    peak_to_gene_method = 'GREAT',
    parallel = T)
Error in dimnamesGets(x, value) :
  length of Dimnames[[2]] (340204) is not equal to Dim[2] (340354)

This time this error is thrown by this line of code:

https://github.com/quadbiolab/Pando/blob/945d977e8f719e1939454fb8979bdb05f592c8b8/R/grn.R#L225

Again I had to solve this by ensuring a one-to-one mapping between cand_regions and peak_ranges.

regions <- NetworkRegions(muo_data)
cand_ranges <- regions@ranges
peak_ranges <- StringToGRanges(rownames(GetAssay(muo_data, assay='peaks')))
peak_overlaps <- findOverlaps(cand_ranges, peak_ranges)

peak_overlaps_df <- data.frame(peak_overlaps)

peak_overlaps_df <- peak_overlaps_df[!duplicated(peak_overlaps_df[['queryHits']]), ]
peak_overlaps_df <- peak_overlaps_df[!duplicated(peak_overlaps_df[['subjectHits']]), ]

new_motif_data <- regions@motifs@data[peak_overlaps_df[['queryHits']], ]
new_regions_peaks <- peak_overlaps_df[['subjectHits']]
new_ranges_data <- regions@ranges[peak_overlaps_df[['queryHits']]]

muo_data@grn@regions@motifs@data <- new_motif_data
muo_data@grn@regions@peaks <- new_regions_peaks
muo_data@grn@regions@ranges <- new_ranges_data
joschif commented 1 year ago

I believe I should have fixed the issue with directly adjacent regions, but overlapping peak regions resulting in ambiguous region accessibility are still a problem. Can you maybe check if your peak regions are adjacent or overlapping, i.e. if length(intersect(peak_ranges, peak_ranges)) < dim(seurat_obj@assays$ATAC)[1]?

SeppeDeWinter commented 1 year ago
> length(intersect(peak_ranges, peak_ranges)) < dim(muo_data@assays$peaks)[1]
[1] TRUE
joschif commented 1 year ago

Ok cool, and as you mentioned before, there should not be overlapping peaks in there, right?

SeppeDeWinter commented 1 year ago

There should not be any which overlap, do you have a snippet of code to make sure (I'm not too familiar with GRanges).

I Guess this, indicates that there are none.

> length(peak_ranges) == length(queryHits(findOverlaps(peak_ranges, minoverlap = 2)))
[1] TRUE

With minoverlap = 2 to discount adjacent regions i.e.

> findOverlaps(GRanges(c('chr1:100-200', 'chr1:200-300')), minoverlap = 2)
SelfHits object with 2 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           1
  [2]         2           2
  -------
  queryLength: 2 / subjectLength: 2
joschif commented 1 year ago

I guess minoverlap=1 should be enough to exclude adjacent ranges. The ranges chr1:100-200 and chr1:200-300 would be overlapping at position 200 if I'm not mistaken. Maybe run this again to check if there are true overlaps. You could also extend the ranges by one in one direction end(peak_ranges) <- end(peak_ranges) + 1 and run this again to check if there are adjacent regions

SeppeDeWinter commented 1 year ago

Ah, yes you're right. My mistake.

But indeed, even with minoverlap = 1 there are no overlapping ones:

> length(peak_ranges) == length(queryHits(findOverlaps(peak_ranges, minoverlap = 1)))
[1] TRUE

but there are directly adjacent regions:

> peak_ranges_extend <- peak_ranges
> end(peak_ranges_extend) <- end(peak_ranges_extend) + 1
> length(peak_ranges_extend) == length(queryHits(findOverlaps(peak_ranges_extend, minoverlap = 1)))
[1] FALSE
joschif commented 1 year ago

Ok great, thanks, this means I did not properly fix it at the time 😅 I'll look into it

SeppeDeWinter commented 1 year ago

Okido!

Thanks for helping with the troubleshooting!

Have a great evening.

Best,

Seppe

vkartha commented 1 year ago

Hi there - was reading the comments on this issue as I am running into the same error, re: dim names and different number of regions, in the infer_grn step. I can confirm that my input peak ranges are also non-overlapping (isDisjoint(peakgranges) returns TRUE).

Error in dimnamesGets(x, value) : 
  length of Dimnames[[2]] (305744) is not equal to Dim[2] (305753)

Any help on this would be greatly appreciated!

joschif commented 1 year ago

I pushed a new version just now (v1.0.2) which should fix this issue, please let me know if it works for you

vkartha commented 1 year ago

Hi there, thanks for the quick turnaround on the new version. For some reason when I try to re-install, it still defaults to v1.0.1

devtools::install_github('quadbiolab/Pando')

  checking for file ‘/tmp/RtmpVm5H6t/remotes3fb5a05bb1d70b/quadbiolab-Pando-a878ee5/DESCRIPTION’ (706ms)
─  preparing ‘Pando’:
✔  checking DESCRIPTION meta-information ...
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
─  building ‘Pando_1.0.1.tar.gz’
<finishes_install, but leads to same error as above since version is older>

and still even when I explicitly specify v1.0.2:

devtools::install_github('quadbiolab/Pando@v1.0.2')

Downloading GitHub repo quadbiolab/Pando@v1.0.2
✔  checking for file ‘/tmp/RtmpVm5H6t/remotes3fb5a06db2d8d2/quadbiolab-Pando-d4a6084/DESCRIPTION’ (598ms)
─  preparing ‘Pando’:
✔  checking DESCRIPTION meta-information ...
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
─  building ‘Pando_1.0.1.tar.gz’

* installing *source* package ‘Pando’ ...
** using staged installation
** R
** data
*** moving datasets to lazyload DB
** byte-compile and prepare package for lazy loading
Note: wrong number of arguments to 'sign' 
** help
*** installing help indices
*** copying figures
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (Pando)

it says 1.0.1 is the latest on there using remotes/devtools, but I do see a v1.0.2 tag on github? This is all under the main branch, it appears?

Thanks again for all the help in advance in helping run our data through this workflow!

joschif commented 1 year ago

Ah yes, that's because I tagged the version on github but did not change the version in the R DESCRIPTION file. It should still be the right version though.

vkartha commented 1 year ago

Hi @joschif , still hitting the same issue even with 1.0.2 (sorry I wasn't clear in my last post, I meant despite the tag I was running into the same error, so I wasn't sure if that was because the code fix wasn't actually pushed or if I was still getting the error for a different reason using the newer tag). Appreciate your help in figuring this out:

Error in dimnamesGets(x, value) : length of Dimnames[[2]] (305744) is not equal to Dim[2] (305753)

joschif commented 1 year ago

Ah ok, that's unfortunate. Can you give me the output of the following after running find_motifs()?

regions <- NetworkRegions(seurat_obj)
length(regions@peaks)
length(unique(regions@peaks))
dim(regions@motifs@data)
dim(seurat_obj@assays$ATAC)

and

peak_ranges <- StringToGRanges(rownames(GetAssay(seurat_obj, assay='ATAC')))
cand_ranges <- regions@ranges
peak_overlaps <- findOverlaps(cand_ranges, peak_ranges)
length(intersect(peak_ranges, peak_ranges))
length(queryHits(peak_overlaps))
length(unique(queryHits(peak_overlaps)))
vkartha commented 1 year ago
> regions <- NetworkRegions(rna.seu)
> length(regions@peaks)
[1] 305753
> length(unique(regions@peaks))
[1] 172648
> dim(regions@motifs@data)
[1] 305744   1590
> dim(rna.seu@assays$peaks)
[1] 468932  22297

and

> peak_ranges <- StringToGRanges(rownames(GetAssay(rna.seu, assay='peaks')))
> cand_ranges <- regions@ranges
> peak_overlaps <- findOverlaps(cand_ranges, peak_ranges)
> length(intersect(peak_ranges, peak_ranges))
[1] 468784
> length(queryHits(peak_overlaps))
[1] 305753
> length(unique(queryHits(peak_overlaps)))
[1] 305744
joschif commented 1 year ago

Yes so it appears that also here there are some directly adjacent peaks and this could be caused by the same problem... Does it work for you if you input candidate ranges that match the peaks 1:1 as mentioned earlier?

frankRuehle commented 1 year ago

Hi there, I am also running into this problem when I have adjacent peak regions and try to initiate the RegulatoryNetwork object with initiate_grn(object, regions = NULL, exclude_exons = TRUE). As far as I understand it, this problem comes from a property of the setdiff function used to remove the exon regions in the initiate_grn function:

if (exclude_exons) {
    exon_ranges <- gene_annot[gene_annot$type == "exon", 
    ]
    names(exon_ranges@ranges) <- NULL
    exon_ranges <- IRanges::intersect(exon_ranges, exon_ranges)
    exon_ranges <- GenomicRanges::GRanges(seqnames = exon_ranges@seqnames, 
                                          ranges = exon_ranges@ranges)
    cand_ranges <- IRanges::setdiff(cand_ranges, exon_ranges, 
                                    ignore.strand = TRUE)    ### adjacent output regions are merged!
  }
  peak_overlaps <- findOverlaps(cand_ranges, peak_ranges) ### Hits object with duplicated queryHits entries for adjacent regions!!

setdiff automatically performs a reduce operation on the output regions. That means that 1 cand_ranges (after setdiff exon removal) can cover 2 adjacent peak_ranges and therefore maps to both peak_ranges in the subsequent findOverlaps call. This leads to the duplicated queryHits in peak_overlaps whenever we have directly adjacent regions in peak_ranges and this causes the error later. This setdiff behaviour is nicely described here, alongside with potential options for fixing.

joschif commented 1 year ago

I've just pushed a new version that might fix the issue outlined by @frankRuehle with a solution suggested on biostars.