tidyomics / plyranges

A grammar of genomic data transformation
https://tidyomics.github.io/plyranges/
137 stars 19 forks source link

Computing KNN between two granges objects #77

Open ShanSabri opened 4 years ago

ShanSabri commented 4 years ago

Hi Stuart,

Thanks for the great package.

I was wondering if it was possible to find the k-nearest neighbors as opposed to the single nearest. For example, I'm interested in tagging ATAC peaks with the 5 nearest genes. I've opened up an issue on GenomicRanges() regarding its unexported findKNN() function and was wondering if you had any insight.

The functions below seem to work perfectly for k=1 nearest neighbor, but I'd like to extend this to k>1, while also retaining the corresponding distances:

>   IRanges::nearest(peaks, tss, ignore.strand = FALSE, select = "all") # k = 1; nearest peak to loci
Hits object with 295913 hits and 0 metadata columns:
           queryHits subjectHits
           <integer>   <integer>
       [1]         1       15215
       [2]         2       15215
       [3]         3       15215
       [4]         4       15215
       [5]         5       15215
       ...       ...         ...
  [295909]    295640       16535
  [295910]    295641       16535
  [295911]    295642       16535
  [295912]    295643       16535
  [295913]    295644       16535
  -------
  queryLength: 295644 / subjectLength: 18436

>   GenomicRanges::distanceToNearest(peaks, tss, select = "all"))# k = 1; nearest peak to loci
Hits object with 295913 hits and 1 metadata column:
           queryHits subjectHits |  distance
           <integer>   <integer> | <integer>
       [1]         1       15215 |    107265
       [2]         2       15215 |    107065
       [3]         3       15215 |    106865
       [4]         4       15215 |    106665
       [5]         5       15215 |    106465
       ...       ...         ... .       ...
  [295909]    295640       16535 |     42858
  [295910]    295641       16535 |     43058
  [295911]    295642       16535 |     43258
  [295912]    295643       16535 |     43458
  [295913]    295644       16535 |     43658
  -------
  queryLength: 295644 / subjectLength: 18436

Any help would be much appreciated!

EDIT: I should mention that reproducible data and examples are posted on the GenomicRanges() issue I opened.

ShanSabri commented 4 years ago

I managed to work up a solution that seems to work for my case.

sa-lee commented 4 years ago

Glad you managed to get something to work for your needs. When I have more time, I will try to implement a family of join_nearest_neighbor_*() functions based on your use case. Would be happy to add you as a contributor, if you would like to have a go at implementing a PR. cc @lawremi