Open sa-lee opened 7 years ago
I have renamed ojoin to be join_overlap_*
so it's consistent with the join_nearest/follow/precede
family and is differentiated from regular SQL join syntax.
Most of these have now been implemented except for
For find_overlaps
and the join_overlap
functions I think it'd be worth adding a directed
suffix if the users wants the overlap to be strand aware. At the moment I have called this anchored
but I think directed
is clearer.
Great. I agree with everything. I was just thinking today that the join_
prefix works with auto-complete and thus makes things more discoverable. Across the API, we should make sure that the operation category always comes first.
An update on these - I still need to implement the _includes and _directed versions of these functions. Should the joins act differently on GroupedRanges?
Are GroupedRanges just the result of group_by()
? If so, I would do whatever dplyr does. I think the behavior is the same, grouped or ungrouped, since there is no aggregation.
I was just looking at these in preparation for the talk, and I'm a bit confused. Here is what I expected:
join_overlap_inner()
: the equivalent of current join_overlap_left()
except it keeps the other ranges as a GRanges, instead of coercing to a data.frame. It could expand its mcols()
though, so just the bare GRanges is stored as the "granges.y" column or something.
join_overlap_left()
: this should be a left-outer join, i.e., the equivalent of bedtools intersect -loj
.
The equivalent of the plain bedtools intersect
requires two calls, one to the proposed join_overlap_inner()
and then something like pintersect()
. So maybe:
join_overlap_inner(x, y) %>% intersect(granges.y)
A couple of bothersome things:
intersect()
would quote its second argument and take it as a column name in the first argument.It may be that intersect()
would only be used in the situation of joining, which adds an auxiliary GRanges as column. We could have an intersect_right()
that just does that as a convenience:
join_overlap_inner(x, y) %>% intersect_right()
but intersect_right()
would need to somehow know the suffix of the join to find the "right" column.
Actually we should probably review GMQL: http://www.bioinformatics.deib.polimi.it/genomic_computing/GMQL/doc/GMQL_Complete_Documentation.pdf
In section 3.4.3, they layout a full-featured genomic join language. Some observations:
PROJECT_
constructors are a bit strange, because they do not carry over any metadata from the other (just the assay data, I guess). I think the group_by_overlaps()
functionality might cover this case, because typically if you want the same set of left (or right) ranges, there's a need to aggregate multiple hits somehow._DISTINCT
constructors seem like they could happen downstream of the join.DISTANCE < 200
), which suggests the need for a maxgap
parameter or similar.So I guess this motivates a join_overlap_intersect()
that does what join_overlap_inner()
does now.
Just to clarify:
join
methods return auxilliary granges columns when appropriate? Currently this is how nearest/precedes/follow
methods works.join_overlap_inner
as the equivalent to bedtools intersect
but agree that just calling it join_overlap_intersect()
makes it clearer.maxgap
argument that is passed down the IRanges/GenomicRanges methods.I'll get around to updating join_overlaps_left
later today.
The only downside to adding the other GRanges is that outer joins will require some sort of NA value. For GRanges, I think we could encode this as a zero-width range [0,-1], because there can't be negative chromosome coordinates.
Most use cases just want to get the metadata from the other GRanges. Some want to combine the ranges in some way, like the intersect case. Other combinations might be extent (min start to max end), union, setdiff, maybe others. But except for extent those could yield multiple ranges when ranges don't overlap, like with maxgap > 0
, or always for setdiff.
Overlap joining is weird, because the key is not an identical match. In the typical join, the key column is unified. The question is, given that the key column is not unified, is the "other key" useful, or are we just interested in the other columns? Maybe we could leave off the other ranges for outer joins and see where things break down.
I've had a go of updating this now. See the latest intro vignette.
Looks great, thanks for taking care of that so quickly.
The lack of support for NA in GRanges and IRanges may cause more trouble than it is worth. The case could be made that the ranges are the key column, and the key column should be unified somehow, whether by intersect or just picking one of them. That appears to be what GMQL does. But bedtools intersect -wa -wb
is a counter-example. They introduce a new format, BEDPE, to represent that.
We may need to make the joins generate a single set of ranges, with none in the mcols()
, based on the unified key principle. Then introduce a new verb, maybe pair_
that includes the other ranges, with "inner" behavior, like
pair_overlap_left()
: Behaves like join_overlap_inner()
now.
pair_overlap_right()
: Switches left and right.
That would get us everything except for bedtools intersect -loj
but we are limited by the infrastructure, and I'm not sure there's a good use case for generating NA ranges like that.
To summarise our chat today:
pair
method should return a left and right ranges column as a DataFrame with no metadata. Note that 1 and 2 also encapsulate the 'nearest/follow/precedes' verbs.pair_(verb)_left/right
For 2, what do you mean by no metadata? I think the metadata on the individual GRanges should be preserved.
Yep that's a typo on my part
These are finished now, I've implemented the basic pair_
methods.
I just need to write some tests for the joins, especially the outer join method.
Re pairing up it would be useful to have methods to allow for the following:
chop_
methodsjust a note to myself to add in within_directed methods
Implement on ranges objects.
Some relevant literature from database theory: GenAp a distributed SQL for genomics using spark
review paper on interval joins
book on temporal databases - in general it would be useful to learn more database theory for this project
slide deck on temporal databases from theory of databases class at Warwick. The algebra covers most use cases encountered in GenomicRanges:
And some new verbs that could be useful such as collapse, expand, pack/unpack
dplyr-like API for joins . not sure the operator
ojoin
is clearest syntax here. Limiting directions to left and inner for overlap joins. Generally we think that the overlapping start/end/equal are not particularly useful.left_join
,right_join
full_join
,inner_join
etc.dplyr
find_overlaps
group_by_overlaps
count_overlaps
ojoin_self
ojoin_inner
/ojoin_inner
*_within
*_includes
join_nearest
/join_nearest_left/right
/join_nearest_up/downstream
join_precede
/join_precede_left/right
/join_precede_up/downstream
join_follow
/join_follow_left/right
/join_follow_up/downstream