invenia / Intervals.jl

Non-iterable ranges
MIT License
35 stars 18 forks source link

Multi-interval set operators #179

Closed haberdashPI closed 2 years ago

haberdashPI commented 3 years ago

This defines a set of functions for easily computing intersections, unions etc... of a set of intervals (passed as an array).

cc: @omus, @ericphanson, @kleinschmidt

This is based off of a closed PR to TimeSpans (https://github.com/beacon-biosignals/TimeSpans.jl/pull/11)

omus commented 2 years ago

Closing/re-opening to trigger CI

haberdashPI commented 2 years ago

I'll try to dive deep into this tomorrow. I'll mention the tests for this package a pretty extensive so be sure to lean on them

I'm still working out a few bits with the larger unit test (outside the tests for sets). Might make sense to wait to review until that happens. I'll mark it as draft again and mark it ready for review when that's resolved.

omus commented 2 years ago

I'm still working out a few bits with the larger unit test (outside the tests for sets). Might make sense to wait to review until that happens. I'll mark it as draft again and mark it ready for review when that's resolved.

Thanks for letting me know. I'll try to keep an eye on this as I think some early course correction could save a bunch of work.

codecov[bot] commented 2 years ago

Codecov Report

Merging #179 (525d3ae) into master (bb193ee) will increase coverage by 2.90%. The diff coverage is 95.76%.

@@            Coverage Diff             @@
##           master     #179      +/-   ##
==========================================
+ Coverage   81.73%   84.63%   +2.90%     
==========================================
  Files          11       12       +1     
  Lines         624      794     +170     
==========================================
+ Hits          510      672     +162     
- Misses        114      122       +8     
Impacted Files Coverage Δ
src/Intervals.jl 100.00% <ø> (ø)
src/endpoint.jl 98.11% <ø> (ø)
src/interval.jl 96.01% <ø> (-0.35%) :arrow_down:
src/interval_sets.jl 95.76% <95.76%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update bb193ee...525d3ae. Read the comment docs.

haberdashPI commented 2 years ago

Looks like there are a few bits I need to add test coverage for, but it should be in good shape to review.

haberdashPI commented 2 years ago

@omus: Can you help me understand the remaining errors I'm getting in "Julia 1" (I think that is 1.7). It looks like there's something broken about printing that is unrelated to the changes I've made.

In other news, I think this is otherwise ready for review.

omus commented 2 years ago

@omus: Can you help me understand the remaining errors I'm getting in "Julia 1" (I think that is 1.7). It looks like there's something broken about printing that is unrelated to the changes I've made.

There was an unrelated CI failure for "Julia 1" so I restarted the CI jobs

In other news, I think this is otherwise ready for review.

Thanks for letting me know. I don't quite have time to review such a big PR this week but I'll make time next week.

haberdashPI commented 2 years ago

I don't quite have time to review such a big PR this week but I'll make time next week.

Thanks! I understand, it's a bit large. I don't think there's any great urgency on my end; I'm making use of the branch directly where I need it in some analyses

haberdashPI commented 2 years ago

Just a small bump here: just wanted to check in about when a review would be possible and/or if I should do something to make reviewing this easier (break it up?).

omus commented 2 years ago

@haberdashPI breaking up a PR would make it easier to review but it may not be necessary. I've been rather swamped lately but I did have this on the agenda for next week. Sorry about the delay on review

haberdashPI commented 2 years ago

Here's another round of changes!

I ended up not removing the code to handle arbitrary endpoints. If you think I should really remove it, I can make that change. The infrastructure for EndpointTracking turned out to be super handy in resolving how exactly how to use LeftEndpoint and RightEndpoint rather than SimpleEndpoint and made determining return types easier (so at least some of the EndpointTracking is useful regardless).

haberdashPI commented 2 years ago

@omus: I'm hoping to get this merged soon, any chance you can take a look at this in the coming week?

haberdashPI commented 2 years ago

Thanks for the feedback! I'll see how CI does with those changes and then fix any issues it finds.

The one unresolved point is about Base.in. I'm not strongly for it, but what do you think of my point about it being consistent with the other methods? If you still think it should be removed, I'll delete it; otherwise I can document the behavior.

omus commented 2 years ago

The one unresolved point is about Base.in. I'm not strongly for it, but what do you think of my point about it being consistent with the other methods? If you still think it should be removed, I'll delete it; otherwise I can document the behavior.

We've resolve this. The end result is to use issubset for this behaviour.

I spent a little time with this PR over the weekend and came to a conclusions:

  1. The tests added for sets here do not cover the common corner cases for intervals especially with bounds. Extending the comparison tests should be most of what we need to do.
  2. Set methods returning non-booleans should all return vectors of intervals. We need to gracefully deprecate the current behaviour (#189) and have this PR not make things worse

Further detail on point 1:

I've implemented the comparison tests in https://github.com/invenia/Intervals.jl/pull/190. Doing so has revealed that the set code implemented here doesn't work correctly when dealing with interval endpoints using the same values. One such example:

julia> union([Interval{Closed,Closed}(1, 3)], [Interval{Open, Open}(3, 5)])
2-element Vector{Interval{Int64}}:
 Interval{Int64, Closed, Closed}(1, 3)
 Interval{Int64, Open, Open}(3, 5)

Other failures can be found by looking for @test_broken in tests/comparisons.jl

Some more detail on point 2:

The original set methods added to Intervals.jl were treating interval instances as sets themselves. Specifically the intersect(::Interval, ::Interval) method would return an interval. In this case the logic is sound but the other set methods can return multiple intervals under certain conditions. Because the set functions should be consistent in their return arguments advocate for the following:

@haberdashPI to wrap up this PR we need to:

haberdashPI commented 2 years ago

Thanks again @omus for all your help with this. I'm now looking into dealing with the case where the intervals touch at the endpoints.

Placing these here, since I can't edit your comments:

haberdashPI commented 2 years ago

Thinking a little more deeply on these corner cases, I have a bit of a philosophical conundrum:

If we treat the returned value as a representation of the set of all points that would be included in the implemented set operation, there are actually an infinite number of such representations. That is, in pseudo code we have

issetequal({[0,1), [1, 2]}, {[0, 2]})

(I'm using {} to represent arrays above, and [ and ( to represent the open/closed distinction)

You can divide any interval into an array of as many subintervals and they remain setequal, according to this representation. The question then becomes, which setequal value is most reasonable? An obvious choice would be the smallest such representation.

However, your tests assume that union([0,1), [2, 3]) returns [0, 3], while symdiff([0,1), [2,3]) returns {[0, 1), [2,3]}, even though these two results are setequal. Arguably the right answer in both cases is [0, 3]

I think your test is the expected/more useful behavior. But it is a somewhat conceptually odd choice...

I could just special case this one situation, but I think it gets weirder when you start thinking about multiple intervals as the input: when should you and when should you not reduce two abutting intervals into one interval across different multi-interval set operations?

I'm currently trying to think through whether there is a systematic reason why intuition for these two examples differs, and how/whether that makes sense to implement.

haberdashPI commented 2 years ago

Thinking through this more:

I think it is reasonable to be replace the x == y test for the return values of symdiff with the following

issetequal(x, y) && length(x) <= length(y)

Because the two outputs symdiff could return (two vs. one interval representation) are setequal, they will produce the same outcome for many computations (just with a smaller N). Any operation that expects the length to be 2 sounds pretty fragile to me, since in general one can't rely on a particular cardinality of the returned value.

haberdashPI commented 2 years ago

I think this test on line 594 of comparisions.jl is not quite right:

            expected_xor = [Interval{Closed, Open}(first(a), first(a))]

That interval isempty but, e.g. !isempty(setdiff([1, 5) , (1, 5))) (which is right, since the singular value at 1 should be included).

There are a number of similar issues in subsequent lines. I've changed these to always use closed bounds on both sides, where appropriate.

haberdashPI commented 2 years ago

Alright! I think that addresses the outstanding issues.

The one open question is whether you buy my arguments above @omus for switching some of the tests to use issetequal

omus commented 2 years ago

However, your tests assume that union([0,1), [2, 3]) returns [0, 3]

Which tests assume this? This would definitely be incorrect. The code at the moment only should combine these intervals if you do a union of [0, 2) and [2, 3]. In your particular example 1 is not included so these wouldn't be combined. Even in the case where we union do [0, 1] and [2, 3] these would not merge these intervals into one as although no integer exists between 1 and 2 detecting this for various types (e.g. Float64) is challenging so we treat these as disjoint.

I think this test on line 594 of comparisions.jl is not quite right... There are a number of similar issues in subsequent lines. I've changed these to always use closed bounds on both sides, where appropriate.

You are definitely correct in I made a mistake for this case. I'll review the other corrections

haberdashPI commented 2 years ago

However, your tests assume that union([0,1), [2, 3]) returns [0, 3]

Which tests assume this? This would definitely be incorrect

My bad: I meant something different

However, your tests assume that union([0,1), [1, 2]) returns [0, 2], while symdiff([0,1), [1, 2]) returns {[0, 1), [1,2]}, even though these two results are setequal. Arguably the right answer in both cases is [0, 2]

omus commented 2 years ago

However, your tests assume that union([0,1), [1, 2]) returns [0, 2], while symdiff([0,1), [1, 2]) returns {[0, 1), [1,2]}, even though these two results are setequal. Arguably the right answer in both cases is [0, 2]

Thanks for the clarification. I agree with you that both answers are valid but I think for our implementation we should ideally return an interval set where all intervals in the returned vector are disjoint.

In the future I can see us making an IntervalSet collection which works like Set but makes use of the special properties intervals provide. For this collection I'd want it to only include disjoint intervals.

As the current state of this PR returns the ideal interval set I'm inclined to update the comparison tests to use == rather than issetequal as this implicitly captures the length aspect of the tests.

haberdashPI commented 2 years ago

Alright, placed some unions on the appropriate expected_xor's and that seems to have cleared up the remaining failed tests.

omus commented 2 years ago

I'll wait until Monday to squash and merge this PR. This should allow anyone who was waiting for the dust to settle on this PR the chance to do a final review before we merge this and release version 1.7.

haberdashPI commented 2 years ago

I do think there are some internal improvements to be made here yet

Agreed! I have had a few thoughts in that respect, that I've been holding back on until this is merged.

omus commented 2 years ago

Registration PR: https://github.com/JuliaRegistries/General/pull/61415

rofinn commented 2 years ago

NOTE: This release actively breaks our code base which relies heavily on the current union and intersect behaviour. Since this PR intentionally changes the behaviour of those methods operating on existing types (ie: vector of intervals) I'm rolling this change back and it should be made as a breaking change. @omus If you're okay with that I'll plan to just yank this release from the registry while try to move these changes into a 2.0 release.

rofinn commented 2 years ago

To clarify the specific change that's breaking is that you could previously perform a normal Base.intersect over a vector of intervals, like so.

julia> intersect([1..2, 2..3, 3..4, 4..5], [2..3, 3..4])
2-element Vector{Interval{Int64, Closed, Closed}}:
 Interval{Int64, Closed, Closed}(2, 3)
 Interval{Int64, Closed, Closed}(3, 4)

and this would behave as expected

Now when you make the same call you get

julia> intersect([1..2, 2..3, 3..4, 4..5], [2..3, 3..4])
1-element Vector{Interval{Int64, Closed, Closed}}:
 Interval{Int64, Closed, Closed}(2, 4)

To clarify, the latter is probably more correct, but since we don't distinguish an IntervalSet from a Vector{<:Interval} there is no way for us to distinguish this ambiguous case. This become particularly problematic when you start dealing with vectors of HourEnding intervals which have very different behaviour depending on whether we're using a regular date or interval type.

Assuming we don't want to dedicate a type for it then I think we should make these separate functionBase.intersect and Intervals.intersect. This allows the end user to decide what behaviour they want.

haberdashPI commented 2 years ago

Hi @rofinn,

First off, sounds like this was pretty disruptive; so sorry for that. A few thoughts on how to move forward:

but since we don't distinguish an IntervalSet from a Vector{<:Interval} there is no way for us to distinguish this ambiguous case

I think this is not quite right. You can pass the objects as Sets to get the old behavior: e.g.

julia> intersect(Set([1..2, 2..3, 3..4, 4..5]), Set([2..3, 3..4]))
Set{Interval{Int64, Closed, Closed}} with 2 elements:
  Interval{Int64, Closed, Closed}(2, 3)
  Interval{Int64, Closed, Closed}(3, 4)

So perhaps, in concept, what is needed here is a deprecation like this?

@deprecate intersect(a::AbstractArray, b::AbstractArray) intersect(Set(a), Set(b))

I guess a potential issue there would be that there might be a performance hit for first converting to Set objects?

In the meantime, perhaps the new behavior can be made available by creating a simple version of the IntervalSet object and dispatching Base.intersect (and friends) over that.

rofinn commented 2 years ago

Yeah, I think at minimum we should either define a minimal IntervalSet or just use the package namespace to distinguish the behaviour. We can always talk about adding a deprecation in a future minor release, but would need to leave making this the default behaviour with the base function until a major release. I'll revert this MR and tag to minimize impact to the ecosystem. I'll then start a new MR with your changes where we can flush out how to make it non-breaking, but still usable for your use cases.

haberdashPI commented 2 years ago

Also, mostly as a note to myself: would be good to add some tests in here for your use case as well.

omus commented 2 years ago

@omus If you're okay with that I'll plan to just yank this release from the registry while try to move these changes into a 2.0 release.

Very reasonable. I apologize for missing this breaking behaviour change. Sounds like we have some additional tests that need to be added to the Intervals.jl test to ensure we don't break expected this behaviour again. Is there a better way to communicate when large changes like this are going to be released? Seems like the typical people-who-care-are-watching isn't working

omus commented 2 years ago

Yeah, I think at minimum we should either define a minimal IntervalSet or just use the package namespace to distinguish the behaviour.

I'm in favour of making a minimal IntervalSet type

rofinn commented 2 years ago

would be good to add some tests in here for your use case as well

Yeah, this is a bit of a weird edge case where we'd need to test for a behaviour that wasn't defined here. I've become rather skittish of extending functions if there's concern that the new method doesn't align with the original concept/definition in Base.

Seems like the typical people-who-care-are-watching isn't working

Yeah, I don't think anyone was ever assigned to keep an eye on this repo after you left. I guess for now it'd probably be best to just assign myself or @fchorney to larger reviews for now as we're the next biggest contributors. I'll raise this maintenance concern more broadly to see if there's a more sustainable option.

I'm in favour of making a minimal IntervalSet type

I was hoping that namespacing would be the simpler solution, but given how many things we already extend in base I think you're probably correct. Moving things from the Base to Intervals namespaces would likely be more breaking.