tidyomics / plyranges

A grammar of genomic data transformation
https://tidyomics.github.io/plyranges/
137 stars 19 forks source link

joins (migrated from query) #5

Open sa-lee opened 7 years ago

sa-lee commented 7 years ago

Implement on ranges objects.


Some relevant literature from database theory: GenAp a distributed SQL for genomics using spark

review paper on interval joins

book on temporal databases - in general it would be useful to learn more database theory for this project

slide deck on temporal databases from theory of databases class at Warwick. The algebra covers most use cases encountered in GenomicRanges:

image

And some new verbs that could be useful such as collapse, expand, pack/unpack

image

image


dplyr-like API for joins . not sure the operator ojoin is clearest syntax here. Limiting directions to left and inner for overlap joins. Generally we think that the overlapping start/end/equal are not particularly useful.

join type description targets
left_join, right_join full_join, inner_join etc. usual SQL like joins in dplyr acts only on mcols slot of Ranges
find_overlaps findOverlaps but acts as user expects with strand, takes into account grouping GenomicRanges/Ranges
group_by_overlaps groups matches by the query hit GenomicRanges/Ranges
count_overlaps convenience function since this is a common operations GenomicRanges/Ranges
ojoin_self findOverlapPairs on a single ranges object GenomicRanges/Ranges
ojoin_inner /ojoin_inner overlaps operator described above. Implemented via findOverlapsPairs(type = "any") + pairs operator such as union/intersect. Can match on an additional key which makes it slightly different to find_overlaps. GenomicRanges/Ranges
*_within same functions as above but with type = "within" GenomicRanges/Ranges
*_includes inverse of within functions GenomicRanges/Ranges
join_nearest/join_nearest_left/right/join_nearest_up/downstream nearest analogue GenomicRanges/Ranges
join_precede/join_precede_left/right/join_precede_up/downstream precede analogue GenomicRanges/Ranges
join_follow/join_follow_left/right/join_follow_up/downstream follow analogue GenomicRanges/Ranges
sa-lee commented 6 years ago

I have renamed ojoin to be join_overlap_* so it's consistent with the join_nearest/follow/precede family and is differentiated from regular SQL join syntax.

Most of these have now been implemented except for

For find_overlaps and the join_overlap functions I think it'd be worth adding a directed suffix if the users wants the overlap to be strand aware. At the moment I have called this anchored but I think directed is clearer.

lawremi commented 6 years ago

Great. I agree with everything. I was just thinking today that the join_ prefix works with auto-complete and thus makes things more discoverable. Across the API, we should make sure that the operation category always comes first.

sa-lee commented 6 years ago

An update on these - I still need to implement the _includes and _directed versions of these functions. Should the joins act differently on GroupedRanges?

lawremi commented 6 years ago

Are GroupedRanges just the result of group_by()? If so, I would do whatever dplyr does. I think the behavior is the same, grouped or ungrouped, since there is no aggregation.

lawremi commented 6 years ago

I was just looking at these in preparation for the talk, and I'm a bit confused. Here is what I expected:

join_overlap_inner(): the equivalent of current join_overlap_left() except it keeps the other ranges as a GRanges, instead of coercing to a data.frame. It could expand its mcols() though, so just the bare GRanges is stored as the "granges.y" column or something.

join_overlap_left(): this should be a left-outer join, i.e., the equivalent of bedtools intersect -loj.

The equivalent of the plain bedtools intersect requires two calls, one to the proposed join_overlap_inner() and then something like pintersect(). So maybe:

join_overlap_inner(x, y) %>% intersect(granges.y)

A couple of bothersome things:

  1. Using mangled, auto-generated column names is ugly. This just comes with joining though.
  2. It is weird how intersect() would quote its second argument and take it as a column name in the first argument.

It may be that intersect() would only be used in the situation of joining, which adds an auxiliary GRanges as column. We could have an intersect_right() that just does that as a convenience:

join_overlap_inner(x, y) %>% intersect_right()

but intersect_right() would need to somehow know the suffix of the join to find the "right" column.

lawremi commented 6 years ago

Actually we should probably review GMQL: http://www.bioinformatics.deib.polimi.it/genomic_computing/GMQL/doc/GMQL_Complete_Documentation.pdf

In section 3.4.3, they layout a full-featured genomic join language. Some observations:

  1. Seems like we should support the naming of additional metadata columns to use in the join predicate. In this way, the overlap joins are just extensions of existing (dplyr) joins.
  2. They introduce the notion of a "region constructor", which in the simple case can be either left, right, intersect, or union (I've renamed some of them). This directly informs this discussion.
  3. The PROJECT_ constructors are a bit strange, because they do not carry over any metadata from the other (just the assay data, I guess). I think the group_by_overlaps() functionality might cover this case, because typically if you want the same set of left (or right) ranges, there's a need to aggregate multiple hits somehow.
  4. The _DISTINCT constructors seem like they could happen downstream of the join.
  5. It sounds like if there is no constructor, they just create a table, or equivalently a data.frame (or I guess tibble) containing two GRanges columns, and all of the metadata.
  6. The genomic predicates basically correspond to overlap, nearest, precedes and follows, but they allow for a window (e.g., DISTANCE < 200), which suggests the need for a maxgap parameter or similar.

So I guess this motivates a join_overlap_intersect() that does what join_overlap_inner() does now.

sa-lee commented 6 years ago

Just to clarify:

I'll get around to updating join_overlaps_left later today.

lawremi commented 6 years ago

The only downside to adding the other GRanges is that outer joins will require some sort of NA value. For GRanges, I think we could encode this as a zero-width range [0,-1], because there can't be negative chromosome coordinates.

Most use cases just want to get the metadata from the other GRanges. Some want to combine the ranges in some way, like the intersect case. Other combinations might be extent (min start to max end), union, setdiff, maybe others. But except for extent those could yield multiple ranges when ranges don't overlap, like with maxgap > 0, or always for setdiff.

Overlap joining is weird, because the key is not an identical match. In the typical join, the key column is unified. The question is, given that the key column is not unified, is the "other key" useful, or are we just interested in the other columns? Maybe we could leave off the other ranges for outer joins and see where things break down.

sa-lee commented 6 years ago

I've had a go of updating this now. See the latest intro vignette.

lawremi commented 6 years ago

Looks great, thanks for taking care of that so quickly.

The lack of support for NA in GRanges and IRanges may cause more trouble than it is worth. The case could be made that the ranges are the key column, and the key column should be unified somehow, whether by intersect or just picking one of them. That appears to be what GMQL does. But bedtools intersect -wa -wb is a counter-example. They introduce a new format, BEDPE, to represent that.

We may need to make the joins generate a single set of ranges, with none in the mcols(), based on the unified key principle. Then introduce a new verb, maybe pair_ that includes the other ranges, with "inner" behavior, like pair_overlap_left(): Behaves like join_overlap_inner() now. pair_overlap_right(): Switches left and right.

That would get us everything except for bedtools intersect -loj but we are limited by the infrastructure, and I'm not sure there's a good use case for generating NA ranges like that.

sa-lee commented 6 years ago

To summarise our chat today:

  1. All joins should just return the metadata from the right hand side ranges, no ranges column.
  2. pair method should return a left and right ranges column as a DataFrame with no metadata. Note that 1 and 2 also encapsulate the 'nearest/follow/precedes' verbs.
  3. optionally implement pair_(verb)_left/right
lawremi commented 6 years ago

For 2, what do you mean by no metadata? I think the metadata on the individual GRanges should be preserved.

sa-lee commented 6 years ago

Yep that's a typo on my part

sa-lee commented 6 years ago

These are finished now, I've implemented the basic pair_ methods. I just need to write some tests for the joins, especially the outer join method.

sa-lee commented 6 years ago

Re pairing up it would be useful to have methods to allow for the following:

sa-lee commented 6 years ago

just a note to myself to add in within_directed methods