tidyomics / plyranges

A grammar of genomic data transformation
https://tidyomics.github.io/plyranges/
140 stars 18 forks source link

distance to nearest helper function? #79

Closed snystrom closed 4 years ago

snystrom commented 4 years ago

Often I want to compare the distance between two GRanges objects. I usually solve this problem using the GRanges function distanceToNearest then appending the distance mcol data to a new column of the subject hits. This is annoying to do inside a plyranges::mutate call, because as far as I'm aware, it requires 2 steps.

It would be nice to add a helper function to facilitate this, perhaps as below:

distance_to_nearest <- function(query, subject, ...){
  hits <- GenomicRanges::distanceToNearest(query, subject, ...)
  mcols(hits)$distance
}

I'm happy to implement this and do a PR, but some thoughts on implementation or input on whether I'm forgetting an edge case or something would be nice first.

snystrom commented 4 years ago

Alternatively, it just occurred to me that what would be ideal is for the join_nearest_* family of functions, it would be amazing to have a distance flag that added the distance from the anchor point to the joined range.

Edit: I am also comfortable trying to implement this approach, but feedback before I start would probably be a good thing.

sa-lee commented 4 years ago

Hi!

I agree that this would be a cool feature to have! Two thoughts on how to proceed, I think both would be useful.

  1. add an argument to join_nearest to include a distance column in the output, with FALSE as the default, I think if you look at the current implementation this would just swap out the function to generate Hits, as I'm fairly sure distanceToNeraest and nearest are equivalent.

  2. Add a function called add_nearest_distance that just adds the distance as an mcols on the query. Similar in design to add_count from dplyr. Reminds me that add_overlap_count would be useful as well.

Happy to review any PR if you're keen to have a go!

snystrom commented 4 years ago

Sounds good. I've got something working and will PR soon. Quick question about defaults. Currently, my implementation of add_nearest_distance uses the default behavior of distanceToNearest which has ignore.strand == FALSE as default. I wonder if it might cause confusion since join_nearest() uses ignore.strand == TRUE. Maybe this is just handled by good documentation, but if you had thoughts on what may integrate best with the rest of the stack, I'm open to suggestions.

sa-lee commented 4 years ago

Thanks for the PR!

Our default is always ignore.strand = TRUE but we don't include this as arguments to a function. Instead we add functions for including strand with the directed suffix, so I would usually split this up so there would be an add_nearest_distance_directed and add_nearest_distance. I'll try and talk at look at this over the next couple of days :D

snystrom commented 4 years ago

Oh, duh. I'll add those into the PR.