tidyomics / plyranges

A grammar of genomic data transformation
https://tidyomics.github.io/plyranges/
137 stars 19 forks source link

Strange behaviour of join_nearest* #76

Closed smped closed 4 years ago

smped commented 4 years ago

Hi Stuart,

Love the package (obviously), but I'm having some dramas with join_nearest*(). In my following example, I need to join feature1 with gene1. The gene is upstream, on the opposite strand to my feature.

library(plyranges)
a <- GRanges("1:11-12:+")
mcols(a)$name <- "feature1"
a
GRanges object with 1 range and 1 metadata column:
      seqnames    ranges strand |        name
         <Rle> <IRanges>  <Rle> | <character>
  [1]        1     11-12      + |    feature1
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
b <- GRanges(c("1:1-2:-", "1:21-22:+"))
mcols(b)$gene <- c("gene1", "gene2")
b
GRanges object with 2 ranges and 1 metadata column:
      seqnames    ranges strand |        gene
         <Rle> <IRanges>  <Rle> | <character>
  [1]        1       1-2      - |       gene1
  [2]        1     21-22      + |       gene2
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Given that gene1 is on the opposite strand to feature 1, join_nearest_upstream() and join_nearest_downstream() behave exactly how I expect.

join_nearest_upstream(a, b)
GRanges object with 0 ranges and 2 metadata columns:
   seqnames    ranges strand |        name        gene
      <Rle> <IRanges>  <Rle> | <character> <character>
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
join_nearest_downstream(a, b)
GRanges object with 1 range and 2 metadata columns:
      seqnames    ranges strand |        name        gene
         <Rle> <IRanges>  <Rle> | <character> <character>
  [1]        1     11-12      + |    feature1       gene2
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

However, if strand information is ignored by join_nearest_left() and join_nearest_right(), I cannot see how the following is able to be explained.

join_nearest_left(a, b)
GRanges object with 0 ranges and 2 metadata columns:
   seqnames    ranges strand |        name        gene
      <Rle> <IRanges>  <Rle> | <character> <character>
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
join_nearest_right(a, b)
GRanges object with 1 range and 2 metadata columns:
      seqnames    ranges strand |        name        gene
         <Rle> <IRanges>  <Rle> | <character> <character>
  [1]        1     11-12      + |    feature1       gene2
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Is the strand information only ignored for one of the ranges (x = a, y = b)?. If I manually remove the strand information, I still cannot join these ranges.

join_nearest_left(unstrand(a), unstrand(b))
GRanges object with 0 ranges and 2 metadata columns:
   seqnames    ranges strand |        name        gene
      <Rle> <IRanges>  <Rle> | <character> <character>
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
join_nearest_right(unstrand(a), unstrand(b))
GRanges object with 1 range and 2 metadata columns:
      seqnames    ranges strand |        name        gene
         <Rle> <IRanges>  <Rle> | <character> <character>
  [1]        1     11-12      * |    feature1       gene2
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

It seems that feature1 & gene1 are unable to be joined by any method here. Should I be using something else?

Thanks,

Steve

sa-lee commented 4 years ago

Hi Steve,

That does seem odd! I’ll take a look at this in more detail next week. In general the docs for these functions could use a bit of work.

Have you tried -join_follow() and friends they might be useful too?

Thanks Stuart

smped commented 4 years ago

Thanks Stuart.

join_follow/join_follow_left() do work, so that might be a possible fix. We're trying to migrate a bedtools workflow to plyranges, so I'll compare results across the two methods once my collaborator is back in with the data from the original run.

Cheers,

Steve

sa-lee commented 4 years ago

Hi Steve,

Glad to hear that follow works. Looking further I'm not so sure if this is a bug, but maybe the documentation is confusing.

plyranges uses IRanges::nearest() under the hood here, with select = "all" and ignore.strand = TRUE used by default. From the nearest docs:

The conventional nearest neighbor finder. Returns an integer vector containing the index of the nearest neighbor range in subject for each range in x. If there is no nearest neighbor (if subject is empty), NA's are returned.

In your example, running nearest(a,b) will always return a vector or hits object parallel to the input (i.e. length 1) and in your example the distance of both ranges in b to those in a is 8, so nearest selects the one with the largest index value.

IRanges::nearest(a,b, ignore.strand = TRUE, select = "all")
Hits object with 1 hit and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           2
  -------
  queryLength: 1 / subjectLength: 2

IRanges::nearest(a, b, ignore.strand = FALSE, select = "all")
Hits object with 1 hit and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           2
  -------
  queryLength: 1 / subjectLength: 2

Hence the identical results for both upstream/right. I realise here that the docs are probably not clear on this matter but hopefully that helps!

ShanSabri commented 4 years ago

Hi Stuart -- Thanks for the great package. I was wondering if it was possible to find the k-nearest neighbors as opposed to the single nearest. For example, I'm interested in tagging ATAC peaks with the 5 nearest genes. I've opened up an issue on GenomicRanges() regarding an unexported KNN function and was wondering if you had any insight.

The functions below seem to work perfectly for k=1 nearest neighbor, but I'd like to extend this to k>1.

>   IRanges::nearest(peaks, tss, ignore.strand = FALSE, select = "all") # k = 1; nearest peak to loci
Hits object with 295913 hits and 0 metadata columns:
           queryHits subjectHits
           <integer>   <integer>
       [1]         1       15215
       [2]         2       15215
       [3]         3       15215
       [4]         4       15215
       [5]         5       15215
       ...       ...         ...
  [295909]    295640       16535
  [295910]    295641       16535
  [295911]    295642       16535
  [295912]    295643       16535
  [295913]    295644       16535
  -------
  queryLength: 295644 / subjectLength: 18436

>   GenomicRanges::distanceToNearest(peaks, tss, select = "all"))# k = 1; nearest peak to loci
Hits object with 295913 hits and 1 metadata column:
           queryHits subjectHits |  distance
           <integer>   <integer> | <integer>
       [1]         1       15215 |    107265
       [2]         2       15215 |    107065
       [3]         3       15215 |    106865
       [4]         4       15215 |    106665
       [5]         5       15215 |    106465
       ...       ...         ... .       ...
  [295909]    295640       16535 |     42858
  [295910]    295641       16535 |     43058
  [295911]    295642       16535 |     43258
  [295912]    295643       16535 |     43458
  [295913]    295644       16535 |     43658
  -------
  queryLength: 295644 / subjectLength: 18436

Any help would be much appreciated!

BTW - It's Shan. We interned together at Genentech. I hope things are going well for you!

sa-lee commented 4 years ago

Hi Shan,

Nice to hear from you, hope you're doing well! One strategy you could try in this particular case is to expand on the transcription start site, then use the left overlap join + an aggregation. Take a look a here for an example of this.

also @shansabri - if you like could you please make another issue for this? that way i can get to implementing it at some point (but also PRs welcome ;P)

ShanSabri commented 4 years ago

expand on the transcription start site, then use the left overlap join + an aggregation

Hmm this is still unclear to me even after looking at the example. I should clarify that I'm able to compute the overlap over a range, but that's not really what I'm aiming to do here.

From the example you posted:

For any gene that has no overlaps, the DA peak columns will have NA’s

I'd like to take the, say 3, nearest genes to a given position (even if they are very far) while also returning their corresponding distances.

Let's continue discussion over here.

Thanks for getting back!

sa-lee commented 4 years ago

No problem, I think the example is going the other way around from what you want as well.

sa-lee commented 4 years ago

@steveped I'm assuming you managed to resolve this, so closing the issue now.

smped commented 4 years ago

Thanks @sa-lee. Passed the conversation to my confused collaborator & they took it from here. Really appreciated the help too.