tidyomics / plyranges

A grammar of genomic data transformation
https://tidyomics.github.io/plyranges/
137 stars 19 forks source link

[BUG] Inconsistent behaviour of join_nearest_downstream() with one unstranded range set? #106

Open eggrandio opened 9 months ago

eggrandio commented 9 months ago

I am not sure if this is intented or if I am missunderstanding join_nearest_downstream().

I have a set of regions that are unstranded (they come from ChIPseq data). I am trying to find the genes that have any of these regions upstream of their transcription start site (or overlap with their coding region). These regions are unstranded but I would like to obtain only the nearest gene downstream of them (meaning that if there is a nearer gene, but the region is downstream of that gene, this gene is ignored).

I use join_nearest_downstream() as it should take into account the strandness of the gene regions and I assume it ignores the strandness of the "x" region set as per the documentation: "method will find arbitrary nearest neighbour ranges on x that are upstream of those on y", but it seems this is not the case. I should find at least one gene for each of the regions in the dataset (these genes might be duplicated if they have several regions in their upstream region, but I can deal with that).

Is this the correct way of doing it ?

The issue is that for some of these regions, no downstream gene is found. When I visualize them or manually check if there is a gene close, I can find it. I can also use join_nearest() to find them, but that has some limitations (see below).

I am working with Arabidopsis data, so as genes I am using these data:

ath_coding_genes <- read_gff("./Arabidopsis_thaliana.TAIR10.57.gff3.gz") %>% 
  filter(type == "gene")

Then I try to find the nearest gene that is downstream of my regions:

my_regions <- dget("original_data.txt") #these regions are unstranded
genes_downstream <- plyranges::join_nearest_downstream(my_regions, ath_coding_genes, distance = TRUE)

This works as expected for many genes, but in some cases it doesnt. I checked some genes that shuold have a peak upstream but they do not show in the "joined" dataset:

test_peak <- "Merged-1-19698653-1"
test_gene <- "AT1G52890"

> my_regions %>% filter(name == test_peak)
GRanges object with 1 range and 2 metadata columns:
      seqnames            ranges strand |                name     score
         <Rle>         <IRanges>  <Rle> |         <character> <numeric>
  [1]        1 19698603-19698703      * | Merged-1-19698653-1         1
  -------
  seqinfo: 5 sequences from an unspecified genome; no seqlengths
> ath_coding_genes %>% filter(gene_id == test_gene) %>% plyranges::select(gene_id)
GRanges object with 1 range and 1 metadata column:
      seqnames            ranges strand |     gene_id
         <Rle>         <IRanges>  <Rle> | <character>
  [1]        1 19696923-19698482      - |   AT1G52890
  -------
  seqinfo: 7 sequences from an unspecified genome; no seqlengths
> genes_downstream %>% filter(name == test_peak) %>% plyranges::select(c(name, gene_id))
GRanges object with 0 ranges and 2 metadata columns:
   seqnames    ranges strand |        name     gene_id
      <Rle> <IRanges>  <Rle> | <character> <character>
  -------
  seqinfo: 5 sequences from an unspecified genome; no seqlengths
> genes_downstream %>% filter(gene_id == test_gene) %>% plyranges::select(c(name, gene_id))
GRanges object with 0 ranges and 2 metadata columns:
   seqnames    ranges strand |        name     gene_id
      <Rle> <IRanges>  <Rle> | <character> <character>
  -------
  seqinfo: 5 sequences from an unspecified genome; no seqlengths

Both regions are 120 nt apart:

> reg1 <- my_regions %>% filter(name == test_peak)
> reg2 <- ath_coding_genes %>% filter(gene_id == test_gene)
> distance(reg1, reg2)
[1] 120

If I use join_nearest() instead, they are joined. However, using nearest will also join some genes that have these regions downstream, and I only want the genes that have these regions upstream.

> nearest_genes <- plyranges::join_nearest(my_regions, ath_coding_genes, distance = TRUE)
> nearest_genes %>% filter(name == test_peak) %>% plyranges::select(c(name, gene_id))
GRanges object with 1 range and 2 metadata columns:
      seqnames            ranges strand |                name     gene_id
         <Rle>         <IRanges>  <Rle> |         <character> <character>
  [1]        1 19698603-19698703      * | Merged-1-19698653-1   AT1G52890
  -------
  seqinfo: 5 sequences from an unspecified genome; no seqlengths
> nearest_genes %>% filter(gene_id == test_gene) %>% plyranges::select(c(name, gene_id))
GRanges object with 2 ranges and 2 metadata columns:
      seqnames            ranges strand |                name     gene_id
         <Rle>         <IRanges>  <Rle> |         <character> <character>
  [1]        1 19698603-19698703      * | Merged-1-19698653-1   AT1G52890
  [2]        1 19698483-19698583      * | Merged-1-19698533-1   AT1G52890
  -------
  seqinfo: 5 sequences from an unspecified genome; no seqlengths
eggrandio commented 9 months ago

For reference, I have written a function using basic GRanges that does what I expected (find which genes are downstream or overlapping a set of regions).

If the observerd behaviour of join_nearest_downstream() is expected, I would modify the documentation and/or include a warning message if using unstranded GRanges.

find_preceding = function(peak_regions, gene_granges) {
  # First we find the genes that have an overlapping region in their coding sequence:
  overlapping_hits <- findOverlaps(peak_regions, gene_granges)

  overlapping_regions <- peak_regions[queryHits(overlapping_hits)]
  overlapping_genes <- gene_granges[subjectHits(overlapping_hits)]
  overlapping_gene_ids <- overlapping_genes$gene_id

  overlapping_granges <- overlapping_regions %>% 
    mutate(gene_id = overlapping_gene_ids,
           dist = distance(overlapping_regions, overlapping_genes)) %>% 
    mutate(name = paste0(name, "_", gene_id))

  # Then we remove the overlapping regions and search for regions preceding genes
  non_overlapping_regions <- peak_regions[-queryHits(overlapping_hits)]

  preceding_hits <- precede(non_overlapping_regions, gene_granges)
  preceding_genes <- gene_granges[preceding_hits]
  preceding_gene_ids <- preceding_genes$gene_id

  preceding_granges <- non_overlapping_regions %>% 
    mutate(gene_id = preceding_gene_ids,
           dist = distance(non_overlapping_regions, preceding_genes)) %>% 
    mutate(name = paste0(name, "_", gene_id))

  # Finally, we merge the overlapping and preceding granges
  final_granges <- bind_ranges(overlapping_granges, preceding_granges) %>% 
    sortSeqlevels() %>% sort()
  return(final_granges)
}