pyranges / ncls

The Nested Containment List for Python. Basically a static interval-tree that is silly fast for both construction and lookups.
BSD 3-Clause "New" or "Revised" License
213 stars 20 forks source link

Inconsistencies in overlap behavior? #14

Open rvanheusden opened 5 years ago

rvanheusden commented 5 years ago

Hi there,

I was doing some testing and it seems that there are some inconsistencies in how NCLS defines overlaps?

Given the following code snippet...

import numpy as np
from ncls import NCLS

starts = np.array([0, 20])
ends = np.array([10, 30])
ids = np.array([0, 1])

tree = NCLS(starts, ends, ids)

...the following results seem inconsistent:

tree.has_overlap(-1, 0)
>>> True

tree.first_overlap_both(np.array([-1]), np.array([0]), np.array([0]))
>>> (array([], dtype=int64), array([], dtype=int64)) # no results

As shown above, though -1, 0 is reported as having an overlap (using has_overlap), the same query interval fails to find an overlapping interval in the tree (using first_overlap_both). There also seems to be some inconsistencies regarding, perhaps, the treatment of intervals containing 0?

Given this snippet...

tree.has_overlap(-1, 0)
>>> True

tree.has_overlap(19, 20)
>>> False

...it seems odd that -1, 0 matches 0, 10 (presumably) while 19, 20 doesn't match 20, 30.

endrebak commented 5 years ago

Thanks for reporting. NCLS is not meant to be used with negative positions though. What is your use-case for that?

I guess I should add a check to see that no negative numbers are used. 19, 20 does not match 20, 30 because ends are non-inclusive in Python :)

rvanheusden commented 5 years ago

Yeah, it seems that -1, 0 shouldn't be matching any intervals that start with 0.

I don't actually have a use-case for negative numbers; I'm using NCLS for biological purposes, as it seems it was somewhat intended for. However, this inconsistency simply came up in my testing. I reported it because I thought it was rather strange.

I would suggest that regardless of whether negative numbers are supported or not, has_overlap and any other overlap methods should have consistent determination of what is an overlap and what isn't.

endrebak commented 5 years ago

I completely agree and I am happy you reported it. I actually only use the vectorized functions because they are so much faster.

rvanheusden commented 5 years ago

Huh, I guess maybe I should be using first_overlap_both and then just checking for results in the interest of speed. Although, that doesn't lend itself too well to readability.

endrebak commented 5 years ago

Yeah, if you have lots of queries represented with starts and ends in a vector, that is faster :)

On Wednesday, July 3, 2019, Rohan Vanheusden notifications@github.com wrote:

Huh, I guess maybe I should be using first_overlap_both and then just checking for results in the interest of speed. Although, that doesn't lend itself too well to readability.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hunt-genes/ncls/issues/14?email_source=notifications&email_token=AEHURUSRMSGJNKRZRGNPTZDP5TPWPA5CNFSM4H4V22X2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZFFPWY#issuecomment-508188635, or mute the thread https://github.com/notifications/unsubscribe-auth/AEHURUTMKASSQUOTU5MCCLTP5TPWPANCNFSM4H4V22XQ .