tidyomics / plyranges

A grammar of genomic data transformation
https://tidyomics.github.io/plyranges/
140 stars 18 forks source link

Feature requeste for full join of GRanges #68

Open alexyfyf opened 5 years ago

alexyfyf commented 5 years ago

Hi All,

I've been working on combing GRanges from several datasets, and wish to have a function from your package that works similar to the tidyverse full_join which includes any extry from at least one Granges object. Also, metadata full_joined.

I have written a simple function to use by myself and worked on my dataset (GRanges with meta data). But I found it hard to incorporate to your packages because I think yours works on a range of *Ranges objects. Anyway, I would like to help and share it if anyone is interested to develop such a function.

join_overlap_full <- function(x,y){
  options(stringsAsFactors = F)
  hit <- findOverlaps(x,y)
  colnames(mcols(x)) <- paste0(colnames(mcols(x)),".x")
  colnames(mcols(y)) <- paste0(colnames(mcols(y)),".y")
  cns.x <- mcols(x) %>% colnames()
  cns.y <- mcols(y) %>% colnames()

  z <- x[queryHits(hit),]
  hitz <- cbind(mcols(y)[subjectHits(hit),],
                mcols(x)[queryHits(hit),])
  colnames(hitz) <- c(cns.y,cns.x)
  mcols(z) <- hitz
  uniqex <- x[-queryHits(hit),]
  uniqey <- y[-subjectHits(hit),]
  mcols(uniqex)[, (length(cns.x)+1):(length(cns.x)+length(cns.y))] <- NA
  colnames(mcols(uniqex)) <- c(cns.x,cns.y)
  mcols(uniqex) <- mcols(uniqex)[,c((length(cns.x)+1):(length(cns.x)+length(cns.y)),1:(length(cns.x)))]
  mcols(uniqey)[, (length(cns.y)+1):(length(cns.x)+length(cns.y))] <- NA
  colnames(mcols(uniqey)) <- c(cns.y,cns.x)
  res <- c(z, uniqex,uniqey)
  mcols(res) <- mcols(res) %>% data.frame() %>% mutate_all(as.character())
  return(res)
}
sa-lee commented 5 years ago

Hi there, I'm not sure exactly what a full join would look like? Could you provide a simple example? If it makes sense you are more than welcome to make a pull request. Look at the code for https://github.com/sa-lee/plyranges/blob/master/R/ranges-overlap-joins-outer.R and https://github.com/sa-lee/plyranges/blob/master/R/ranges-overlap-find.R to see how I've constructed join_overlap_left()

alexyfyf commented 5 years ago

Hi there, I'm not sure exactly what a full join would look like? Could you provide a simple example? If it makes sense you are more than welcome to make a pull request. Look at the code for https://github.com/sa-lee/plyranges/blob/master/R/ranges-overlap-joins-outer.R and https://github.com/sa-lee/plyranges/blob/master/R/ranges-overlap-find.R to see how I've constructed join_overlap_left()

Thank you for your reply. I'll have a look and try to make a pull request. The example would like this. If there's any function from your package could do this, please let me know.

gr1 <- GRanges(seqnames=Rle(c('chr1', 'chr2', 'chr3'), c(3, 3, 4)),
               IRanges(1:10, width=5),
               strand='-',
               score=101:110, GC=runif(10))
gr2 <- GRanges(seqnames=Rle(c('chr1', 'chr2', 'chr3'), c(3, 4, 3)),
               IRanges(6:15, width=10),
               strand='-',
               score=21:30)
join_overlap_full(gr1,gr2)

The output will be like: gr1

GRanges object with 10 ranges and 2 metadata columns:
       seqnames    ranges strand |     score                 GC
          <Rle> <IRanges>  <Rle> | <integer>          <numeric>
   [1]     chr1       1-5      - |       101  0.930739698465914
   [2]     chr1       2-6      - |       102  0.566913234768435
   [3]     chr1       3-7      - |       103  0.682077450212091
   [4]     chr2       4-8      - |       104  0.889251226792112
   [5]     chr2       5-9      - |       105 0.0200071565341204
   [6]     chr2      6-10      - |       106  0.208122601732612
   [7]     chr3      7-11      - |       107 0.0573086701333523
   [8]     chr3      8-12      - |       108  0.578096957644448
   [9]     chr3      9-13      - |       109  0.802684629801661
  [10]     chr3     10-14      - |       110   0.71569463587366
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

gr2

GRanges object with 10 ranges and 1 metadata column:
       seqnames    ranges strand |     score
          <Rle> <IRanges>  <Rle> | <integer>
   [1]     chr1      6-15      - |        21
   [2]     chr1      7-16      - |        22
   [3]     chr1      8-17      - |        23
   [4]     chr2      9-18      - |        24
   [5]     chr2     10-19      - |        25
   [6]     chr2     11-20      - |        26
   [7]     chr2     12-21      - |        27
   [8]     chr3     13-22      - |        28
   [9]     chr3     14-23      - |        29
  [10]     chr3     15-24      - |        30
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

The full_joined

GRanges object with 17 ranges and 3 metadata columns:
       seqnames    ranges strand |   score.y   score.x               GC.x
          <Rle> <IRanges>  <Rle> | <integer> <integer>          <numeric>
   [1]     chr1       2-6      - |        21       102  0.566913234768435
   [2]     chr1       3-7      - |        21       103  0.682077450212091
   [3]     chr1       3-7      - |        22       103  0.682077450212091
   [4]     chr2       5-9      - |        24       105 0.0200071565341204
   [5]     chr2      6-10      - |        24       106  0.208122601732612
   ...      ...       ...    ... .       ...       ...                ...
  [13]     chr3      8-12      - |      <NA>       108  0.578096957644448
  [14]     chr1      8-17      - |        23      <NA>               <NA>
  [15]     chr2     11-20      - |        26      <NA>               <NA>
  [16]     chr2     12-21      - |        27      <NA>               <NA>
  [17]     chr3     15-24      - |        30      <NA>               <NA>
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
jacel commented 4 years ago

Hi, I was wondering if you've made progress on this? I'd really like to be able to do full joins in plyranges!

sa-lee commented 4 years ago

Hi, no sorry! I'm pretty busy with other projects at the moment, but will see if I have time over the next couple of months to implement it.

jacel commented 4 years ago

No problem! Thanks so much for creating plyranges in the first place. It's an awesome package, I use it all the time.

alexyfyf commented 4 years ago

I use this function to full_join GRanges, might not fit in everyone's case (or maybe everyone has a different definition full_join). I'm sure someone can make it better, so I would like to share here.

join_overlap_full <- function(x,y){
  options(stringsAsFactors = F)
  hit <- findOverlaps(x,y)
  colnames(mcols(x)) <- paste0(colnames(mcols(x)),".x")
  colnames(mcols(y)) <- paste0(colnames(mcols(y)),".y")
  cns.x <- mcols(x) %>% colnames()
  cns.y <- mcols(y) %>% colnames()

  z <- x[queryHits(hit),]
  hitz <- cbind(mcols(y)[subjectHits(hit),],
                mcols(x)[queryHits(hit),])
  colnames(hitz) <- c(cns.y,cns.x)
  mcols(z) <- hitz
  uniqex <- x[-queryHits(hit),]
  uniqey <- y[-subjectHits(hit),]
  mcols(uniqex)[, (length(cns.x)+1):(length(cns.x)+length(cns.y))] <- NA
  colnames(mcols(uniqex)) <- c(cns.x,cns.y)
  mcols(uniqex) <- mcols(uniqex)[,c((length(cns.x)+1):(length(cns.x)+length(cns.y)),1:(length(cns.x)))]
  mcols(uniqey)[, (length(cns.y)+1):(length(cns.x)+length(cns.y))] <- NA
  colnames(mcols(uniqey)) <- c(cns.y,cns.x)
  res <- c(z, uniqex,uniqey)
  mcols(res) <- mcols(res) %>% data.frame() %>% mutate_all(as.character())
  return(res)
}
sa-lee commented 4 years ago

This looks like it's on the right track! Take a look at https://github.com/sa-lee/plyranges/blob/ce998131510d1da7b3c9e47deba5946b5c33ab1d/R/ranges-overlap-joins-outer.R#L55-L86 which provides the template for doing the left outer join. You're basically adding an extra step to pull in the right hand side, i.e. only_right[subjectHits(hits)] <- FALSE. Happy for you to be added as a ctb if you want to put this in as PR with some unit tests :D

alexyfyf commented 4 years ago

Hi @sa-lee , I will have a try. So this overlaop_joins_outer is actually the function join_overlap_left? I didn't see any other function in this wrapper. Do you think I should add my join_overlap_full on top of this? Sounds a bit confusing.

egenomics commented 3 years ago

@sa-lee Have you resolved this issue? I don't understand if the function provided by @alexyfyf is valid or not.

alexyfyf commented 1 year ago

At least it is valid in the test case, again, I am not sure if everyone has the same definition of full_join GRanges.

markphillippebworth commented 1 year ago

This is a fantastic package, and it's one of the critical dependencies for my own package in CRAN. Having a full_join function would be good addition, and round out the 'join' operations.

The previous function doesn't sort out the metadata right, tacking on .x and .y labels, even if the metadata column name is unique to one group.

The other way you could do a full join for objects GR1 and GR2 is as follows: uniqueGR2 <- filter_by_non_overlaps(GR2, GR1) overlapGR<- join_overlap_left(GR1, GR2) bind_ranges(overlapGR, uniqueGR2)

This works once I've essentially pre-figured out which metadata columns I want to keep from each GRanges. Another way would be to add the ability to merge metadata if the column names are the same (i.e. sum, list, paste, etc..), or just add the .x and .y. Thanks!