tidyomics / plyranges

A grammar of genomic data transformation
https://tidyomics.github.io/plyranges/
137 stars 19 forks source link

[FEATURE REQUEST] Join overlap by ranges as well as metadata #107

Open shalvichirmade opened 7 months ago

shalvichirmade commented 7 months ago

Feature request: add an argument to the join_overlap_intersect function that allows additional overlap based on metadata values.

For some context, here is a hypothetical example: I have two GRanges objects, one for introns and one for transcripts.

## intron GRanges
intron
# GRanges object with 1 range and 2 metadata columns:
#     seqnames       ranges        strand |     type             transcript_id 
#      <Rle>       <IRanges>           <Rle>  |   <factor>         <character>
#      1       100149098-100152384     -       |    intron            ENST00000370137.6

## transcript GRanges
trans
# GRanges object with 2 range and 3 metadata columns:
#     seqnames       ranges             strand |    transcript_name           gene_name
#      <Rle>        <IRanges>              <Rle>  |   <character>                <character>
#        1            100148448-100178256     -       |    ENST00000370137.6            LRRC39
#        1            100133163-100150496     -       |     ENST00000370141.8           TRMT13

I want to join these GRanges objects so I can annotate the intron GRanges with gene_name metadata.

However, when I use join_overlap_left, the range of the intron row overlaps both the rows from trans.

intron <- join_overlap_left(intron, trans)
intron
# GRanges object with 2 range and 3 metadata columns:
#     seqnames       ranges               strand |     type             transcript_id                transcript_name           gene_name
#     <Rle>        <IRanges>        <Rle>  |   <factor>       <character>                  <character>                  <character>
#        1             100149098-100152384     -       |    intron          ENST00000370137.6    ENST00000370137.6   LRRC39
#        1             100149098-100152384     -       |    intron          ENST00000370137.6     ENST00000370141.8   TRMT13

The desired output would only overlap with the trans row corresponding to trans$transcript_name == "ENST00000370137.6".

Here, the overlap should be based on the range as well as the metadata columns:

R session information

Remember to include your full R session information.

options(width = 120)
sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.2 (2023-10-31)
 os       macOS Sonoma 14.3
 system   x86_64, darwin20
 ui       RStudio
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/Toronto
 date     2024-02-08
 rstudio  2023.06.1+524 Mountain Hydrangea (desktop)
 pandoc   NA