Closed smped closed 4 years ago
Hi Steve,
That does seem odd! I’ll take a look at this in more detail next week. In general the docs for these functions could use a bit of work.
Have you tried -join_follow()
and friends they might be useful too?
Thanks Stuart
Thanks Stuart.
join_follow/join_follow_left()
do work, so that might be a possible fix. We're trying to migrate a bedtools workflow to plyranges, so I'll compare results across the two methods once my collaborator is back in with the data from the original run.
Cheers,
Steve
Hi Steve,
Glad to hear that follow
works. Looking further I'm not so sure if this is a bug, but maybe the documentation is confusing.
plyranges uses IRanges::nearest()
under the hood here, with select = "all"
and ignore.strand = TRUE used by default. From the nearest
docs:
The conventional nearest neighbor finder. Returns an integer vector containing the index of the nearest neighbor range in subject for each range in x. If there is no nearest neighbor (if subject is empty), NA's are returned.
In your example, running nearest(a,b)
will always return a vector or hits object parallel to the input (i.e. length 1) and in your example the distance of both ranges in b to those in a is 8, so nearest selects the one with the largest index value.
IRanges::nearest(a,b, ignore.strand = TRUE, select = "all")
Hits object with 1 hit and 0 metadata columns:
queryHits subjectHits
<integer> <integer>
[1] 1 2
-------
queryLength: 1 / subjectLength: 2
IRanges::nearest(a, b, ignore.strand = FALSE, select = "all")
Hits object with 1 hit and 0 metadata columns:
queryHits subjectHits
<integer> <integer>
[1] 1 2
-------
queryLength: 1 / subjectLength: 2
Hence the identical results for both upstream/right. I realise here that the docs are probably not clear on this matter but hopefully that helps!
Hi Stuart -- Thanks for the great package. I was wondering if it was possible to find the k-nearest neighbors as opposed to the single nearest. For example, I'm interested in tagging ATAC peaks with the 5 nearest genes. I've opened up an issue on GenomicRanges()
regarding an unexported KNN function and was wondering if you had any insight.
The functions below seem to work perfectly for k=1 nearest neighbor, but I'd like to extend this to k>1.
> IRanges::nearest(peaks, tss, ignore.strand = FALSE, select = "all") # k = 1; nearest peak to loci
Hits object with 295913 hits and 0 metadata columns:
queryHits subjectHits
<integer> <integer>
[1] 1 15215
[2] 2 15215
[3] 3 15215
[4] 4 15215
[5] 5 15215
... ... ...
[295909] 295640 16535
[295910] 295641 16535
[295911] 295642 16535
[295912] 295643 16535
[295913] 295644 16535
-------
queryLength: 295644 / subjectLength: 18436
> GenomicRanges::distanceToNearest(peaks, tss, select = "all"))# k = 1; nearest peak to loci
Hits object with 295913 hits and 1 metadata column:
queryHits subjectHits | distance
<integer> <integer> | <integer>
[1] 1 15215 | 107265
[2] 2 15215 | 107065
[3] 3 15215 | 106865
[4] 4 15215 | 106665
[5] 5 15215 | 106465
... ... ... . ...
[295909] 295640 16535 | 42858
[295910] 295641 16535 | 43058
[295911] 295642 16535 | 43258
[295912] 295643 16535 | 43458
[295913] 295644 16535 | 43658
-------
queryLength: 295644 / subjectLength: 18436
Any help would be much appreciated!
BTW - It's Shan. We interned together at Genentech. I hope things are going well for you!
Hi Shan,
Nice to hear from you, hope you're doing well! One strategy you could try in this particular case is to expand on the transcription start site, then use the left overlap join + an aggregation. Take a look a here for an example of this.
also @shansabri - if you like could you please make another issue for this? that way i can get to implementing it at some point (but also PRs welcome ;P)
expand on the transcription start site, then use the left overlap join + an aggregation
Hmm this is still unclear to me even after looking at the example. I should clarify that I'm able to compute the overlap over a range, but that's not really what I'm aiming to do here.
From the example you posted:
For any gene that has no overlaps, the DA peak columns will have NA’s
I'd like to take the, say 3, nearest genes to a given position (even if they are very far) while also returning their corresponding distances.
Let's continue discussion over here.
Thanks for getting back!
No problem, I think the example is going the other way around from what you want as well.
@steveped I'm assuming you managed to resolve this, so closing the issue now.
Thanks @sa-lee. Passed the conversation to my confused collaborator & they took it from here. Really appreciated the help too.
Hi Stuart,
Love the package (obviously), but I'm having some dramas with
join_nearest*()
. In my following example, I need to join feature1 with gene1. The gene is upstream, on the opposite strand to my feature.Given that gene1 is on the opposite strand to feature 1,
join_nearest_upstream()
andjoin_nearest_downstream()
behave exactly how I expect.However, if strand information is ignored by
join_nearest_left()
andjoin_nearest_right()
, I cannot see how the following is able to be explained.Is the strand information only ignored for one of the ranges (x = a, y = b)?. If I manually remove the strand information, I still cannot join these ranges.
It seems that feature1 & gene1 are unable to be joined by any method here. Should I be using something else?
Thanks,
Steve