Open eggrandio opened 9 months ago
For reference, I have written a function using basic GRanges that does what I expected (find which genes are downstream or overlapping a set of regions).
If the observerd behaviour of join_nearest_downstream() is expected, I would modify the documentation and/or include a warning message if using unstranded GRanges.
find_preceding = function(peak_regions, gene_granges) {
# First we find the genes that have an overlapping region in their coding sequence:
overlapping_hits <- findOverlaps(peak_regions, gene_granges)
overlapping_regions <- peak_regions[queryHits(overlapping_hits)]
overlapping_genes <- gene_granges[subjectHits(overlapping_hits)]
overlapping_gene_ids <- overlapping_genes$gene_id
overlapping_granges <- overlapping_regions %>%
mutate(gene_id = overlapping_gene_ids,
dist = distance(overlapping_regions, overlapping_genes)) %>%
mutate(name = paste0(name, "_", gene_id))
# Then we remove the overlapping regions and search for regions preceding genes
non_overlapping_regions <- peak_regions[-queryHits(overlapping_hits)]
preceding_hits <- precede(non_overlapping_regions, gene_granges)
preceding_genes <- gene_granges[preceding_hits]
preceding_gene_ids <- preceding_genes$gene_id
preceding_granges <- non_overlapping_regions %>%
mutate(gene_id = preceding_gene_ids,
dist = distance(non_overlapping_regions, preceding_genes)) %>%
mutate(name = paste0(name, "_", gene_id))
# Finally, we merge the overlapping and preceding granges
final_granges <- bind_ranges(overlapping_granges, preceding_granges) %>%
sortSeqlevels() %>% sort()
return(final_granges)
}
I am not sure if this is intented or if I am missunderstanding join_nearest_downstream().
I have a set of regions that are unstranded (they come from ChIPseq data). I am trying to find the genes that have any of these regions upstream of their transcription start site (or overlap with their coding region). These regions are unstranded but I would like to obtain only the nearest gene downstream of them (meaning that if there is a nearer gene, but the region is downstream of that gene, this gene is ignored).
I use join_nearest_downstream() as it should take into account the strandness of the gene regions and I assume it ignores the strandness of the "x" region set as per the documentation: "method will find arbitrary nearest neighbour ranges on x that are upstream of those on y", but it seems this is not the case. I should find at least one gene for each of the regions in the dataset (these genes might be duplicated if they have several regions in their upstream region, but I can deal with that).
Is this the correct way of doing it ?
The issue is that for some of these regions, no downstream gene is found. When I visualize them or manually check if there is a gene close, I can find it. I can also use join_nearest() to find them, but that has some limitations (see below).
I am working with Arabidopsis data, so as genes I am using these data:
Then I try to find the nearest gene that is downstream of my regions:
This works as expected for many genes, but in some cases it doesnt. I checked some genes that shuold have a peak upstream but they do not show in the "joined" dataset:
Both regions are 120 nt apart:
If I use join_nearest() instead, they are joined. However, using nearest will also join some genes that have these regions downstream, and I only want the genes that have these regions upstream.