Closed haberdashPI closed 2 years ago
Closing/re-opening to trigger CI
I'll try to dive deep into this tomorrow. I'll mention the tests for this package a pretty extensive so be sure to lean on them
I'm still working out a few bits with the larger unit test (outside the tests for sets). Might make sense to wait to review until that happens. I'll mark it as draft again and mark it ready for review when that's resolved.
I'm still working out a few bits with the larger unit test (outside the tests for sets). Might make sense to wait to review until that happens. I'll mark it as draft again and mark it ready for review when that's resolved.
Thanks for letting me know. I'll try to keep an eye on this as I think some early course correction could save a bunch of work.
Merging #179 (525d3ae) into master (bb193ee) will increase coverage by
2.90%
. The diff coverage is95.76%
.
@@ Coverage Diff @@
## master #179 +/- ##
==========================================
+ Coverage 81.73% 84.63% +2.90%
==========================================
Files 11 12 +1
Lines 624 794 +170
==========================================
+ Hits 510 672 +162
- Misses 114 122 +8
Impacted Files | Coverage Δ | |
---|---|---|
src/Intervals.jl | 100.00% <ø> (ø) |
|
src/endpoint.jl | 98.11% <ø> (ø) |
|
src/interval.jl | 96.01% <ø> (-0.35%) |
:arrow_down: |
src/interval_sets.jl | 95.76% <95.76%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update bb193ee...525d3ae. Read the comment docs.
Looks like there are a few bits I need to add test coverage for, but it should be in good shape to review.
@omus: Can you help me understand the remaining errors I'm getting in "Julia 1" (I think that is 1.7). It looks like there's something broken about printing that is unrelated to the changes I've made.
In other news, I think this is otherwise ready for review.
@omus: Can you help me understand the remaining errors I'm getting in "Julia 1" (I think that is 1.7). It looks like there's something broken about printing that is unrelated to the changes I've made.
There was an unrelated CI failure for "Julia 1" so I restarted the CI jobs
In other news, I think this is otherwise ready for review.
Thanks for letting me know. I don't quite have time to review such a big PR this week but I'll make time next week.
I don't quite have time to review such a big PR this week but I'll make time next week.
Thanks! I understand, it's a bit large. I don't think there's any great urgency on my end; I'm making use of the branch directly where I need it in some analyses
Just a small bump here: just wanted to check in about when a review would be possible and/or if I should do something to make reviewing this easier (break it up?).
@haberdashPI breaking up a PR would make it easier to review but it may not be necessary. I've been rather swamped lately but I did have this on the agenda for next week. Sorry about the delay on review
Here's another round of changes!
I ended up not removing the code to handle arbitrary endpoints. If you think I should really remove it, I can make that change. The infrastructure for EndpointTracking
turned out to be super handy in resolving how exactly how to use LeftEndpoint
and RightEndpoint
rather than SimpleEndpoint
and made determining return types easier (so at least some of the EndpointTracking is useful regardless).
@omus: I'm hoping to get this merged soon, any chance you can take a look at this in the coming week?
Thanks for the feedback! I'll see how CI does with those changes and then fix any issues it finds.
The one unresolved point is about Base.in
. I'm not strongly for it, but what do you think of my point about it being consistent with the other methods? If you still think it should be removed, I'll delete it; otherwise I can document the behavior.
The one unresolved point is about
Base.in
. I'm not strongly for it, but what do you think of my point about it being consistent with the other methods? If you still think it should be removed, I'll delete it; otherwise I can document the behavior.
We've resolve this. The end result is to use issubset
for this behaviour.
I spent a little time with this PR over the weekend and came to a conclusions:
Further detail on point 1:
I've implemented the comparison tests in https://github.com/invenia/Intervals.jl/pull/190. Doing so has revealed that the set code implemented here doesn't work correctly when dealing with interval endpoints using the same values. One such example:
julia> union([Interval{Closed,Closed}(1, 3)], [Interval{Open, Open}(3, 5)])
2-element Vector{Interval{Int64}}:
Interval{Int64, Closed, Closed}(1, 3)
Interval{Int64, Open, Open}(3, 5)
Other failures can be found by looking for @test_broken
in tests/comparisons.jl
Some more detail on point 2:
The original set methods added to Intervals.jl were treating interval instances as sets themselves. Specifically the intersect(::Interval, ::Interval)
method would return an interval. In this case the logic is sound but the other set methods can return multiple intervals under certain conditions. Because the set functions should be consistent in their return arguments advocate for the following:
f(::Interval, ::Interval) -> Vector{Interval}
@haberdashPI to wrap up this PR we need to:
Thanks again @omus for all your help with this. I'm now looking into dealing with the case where the intervals touch at the endpoints.
Placing these here, since I can't edit your comments:
Thinking a little more deeply on these corner cases, I have a bit of a philosophical conundrum:
If we treat the returned value as a representation of the set of all points that would be included in the implemented set operation, there are actually an infinite number of such representations. That is, in pseudo code we have
issetequal({[0,1), [1, 2]}, {[0, 2]})
(I'm using {}
to represent arrays above, and [
and (
to represent the open/closed distinction)
You can divide any interval into an array of as many subintervals and they remain setequal, according to this representation. The question then becomes, which setequal value is most reasonable? An obvious choice would be the smallest such representation.
However, your tests assume that union([0,1), [2, 3])
returns [0, 3]
, while symdiff([0,1), [2,3])
returns {[0, 1), [2,3]}
, even though these two results are setequal
. Arguably the right answer in both cases is [0, 3]
I think your test is the expected/more useful behavior. But it is a somewhat conceptually odd choice...
I could just special case this one situation, but I think it gets weirder when you start thinking about multiple intervals as the input: when should you and when should you not reduce two abutting intervals into one interval across different multi-interval set operations?
I'm currently trying to think through whether there is a systematic reason why intuition for these two examples differs, and how/whether that makes sense to implement.
Thinking through this more:
I think it is reasonable to be replace the x == y
test for the return values of symdiff with the following
issetequal(x, y) && length(x) <= length(y)
Because the two outputs symdiff could return (two vs. one interval representation) are setequal, they will produce the same outcome for many computations (just with a smaller N). Any operation that expects the length to be 2 sounds pretty fragile to me, since in general one can't rely on a particular cardinality of the returned value.
I think this test on line 594 of comparisions.jl is not quite right:
expected_xor = [Interval{Closed, Open}(first(a), first(a))]
That interval isempty
but, e.g. !isempty(setdiff([1, 5) , (1, 5)))
(which is right, since the singular value at 1 should be included).
There are a number of similar issues in subsequent lines. I've changed these to always use closed bounds on both sides, where appropriate.
Alright! I think that addresses the outstanding issues.
The one open question is whether you buy my arguments above @omus for switching some of the tests to use issetequal
However, your tests assume that
union([0,1), [2, 3])
returns[0, 3]
Which tests assume this? This would definitely be incorrect. The code at the moment only should combine these intervals if you do a union of [0, 2)
and [2, 3]
. In your particular example 1
is not included so these wouldn't be combined. Even in the case where we union do [0, 1]
and [2, 3]
these would not merge these intervals into one as although no integer exists between 1
and 2
detecting this for various types (e.g. Float64
) is challenging so we treat these as disjoint.
I think this test on line 594 of comparisions.jl is not quite right... There are a number of similar issues in subsequent lines. I've changed these to always use closed bounds on both sides, where appropriate.
You are definitely correct in I made a mistake for this case. I'll review the other corrections
However, your tests assume that
union([0,1), [2, 3])
returns[0, 3]
Which tests assume this? This would definitely be incorrect
My bad: I meant something different
However, your tests assume that
union([0,1), [1, 2])
returns[0, 2]
, whilesymdiff([0,1), [1, 2])
returns{[0, 1), [1,2]}
, even though these two results aresetequal
. Arguably the right answer in both cases is[0, 2]
However, your tests assume that
union([0,1), [1, 2])
returns[0, 2]
, whilesymdiff([0,1), [1, 2])
returns{[0, 1), [1,2]}
, even though these two results aresetequal
. Arguably the right answer in both cases is[0, 2]
Thanks for the clarification. I agree with you that both answers are valid but I think for our implementation we should ideally return an interval set where all intervals in the returned vector are disjoint.
In the future I can see us making an IntervalSet
collection which works like Set
but makes use of the special properties intervals provide. For this collection I'd want it to only include disjoint intervals.
As the current state of this PR returns the ideal interval set I'm inclined to update the comparison tests to use ==
rather than issetequal
as this implicitly captures the length
aspect of the tests.
Alright, placed some unions on the appropriate expected_xor's and that seems to have cleared up the remaining failed tests.
I'll wait until Monday to squash and merge this PR. This should allow anyone who was waiting for the dust to settle on this PR the chance to do a final review before we merge this and release version 1.7.
I do think there are some internal improvements to be made here yet
Agreed! I have had a few thoughts in that respect, that I've been holding back on until this is merged.
Registration PR: https://github.com/JuliaRegistries/General/pull/61415
NOTE: This release actively breaks our code base which relies heavily on the current union
and intersect
behaviour. Since this PR intentionally changes the behaviour of those methods operating on existing types (ie: vector of intervals) I'm rolling this change back and it should be made as a breaking change. @omus If you're okay with that I'll plan to just yank this release from the registry while try to move these changes into a 2.0 release.
To clarify the specific change that's breaking is that you could previously perform a normal Base.intersect
over a vector of intervals, like so.
julia> intersect([1..2, 2..3, 3..4, 4..5], [2..3, 3..4])
2-element Vector{Interval{Int64, Closed, Closed}}:
Interval{Int64, Closed, Closed}(2, 3)
Interval{Int64, Closed, Closed}(3, 4)
and this would behave as expected
Now when you make the same call you get
julia> intersect([1..2, 2..3, 3..4, 4..5], [2..3, 3..4])
1-element Vector{Interval{Int64, Closed, Closed}}:
Interval{Int64, Closed, Closed}(2, 4)
To clarify, the latter is probably more correct, but since we don't distinguish an IntervalSet
from a Vector{<:Interval}
there is no way for us to distinguish this ambiguous case. This become particularly problematic when you start dealing with vectors of HourEnding
intervals which have very different behaviour depending on whether we're using a regular date or interval type.
Assuming we don't want to dedicate a type for it then I think we should make these separate functionBase.intersect
and Intervals.intersect
. This allows the end user to decide what behaviour they want.
Hi @rofinn,
First off, sounds like this was pretty disruptive; so sorry for that. A few thoughts on how to move forward:
but since we don't distinguish an IntervalSet from a Vector{<:Interval} there is no way for us to distinguish this ambiguous case
I think this is not quite right. You can pass the objects as Set
s to get the old behavior: e.g.
julia> intersect(Set([1..2, 2..3, 3..4, 4..5]), Set([2..3, 3..4]))
Set{Interval{Int64, Closed, Closed}} with 2 elements:
Interval{Int64, Closed, Closed}(2, 3)
Interval{Int64, Closed, Closed}(3, 4)
So perhaps, in concept, what is needed here is a deprecation like this?
@deprecate intersect(a::AbstractArray, b::AbstractArray) intersect(Set(a), Set(b))
I guess a potential issue there would be that there might be a performance hit for first converting to Set
objects?
In the meantime, perhaps the new behavior can be made available by creating a simple version of the IntervalSet
object and dispatching Base.intersect
(and friends) over that.
Yeah, I think at minimum we should either define a minimal IntervalSet
or just use the package namespace to distinguish the behaviour. We can always talk about adding a deprecation in a future minor release, but would need to leave making this the default behaviour with the base function until a major release. I'll revert this MR and tag to minimize impact to the ecosystem. I'll then start a new MR with your changes where we can flush out how to make it non-breaking, but still usable for your use cases.
Also, mostly as a note to myself: would be good to add some tests in here for your use case as well.
@omus If you're okay with that I'll plan to just yank this release from the registry while try to move these changes into a 2.0 release.
Very reasonable. I apologize for missing this breaking behaviour change. Sounds like we have some additional tests that need to be added to the Intervals.jl test to ensure we don't break expected this behaviour again. Is there a better way to communicate when large changes like this are going to be released? Seems like the typical people-who-care-are-watching isn't working
Yeah, I think at minimum we should either define a minimal
IntervalSet
or just use the package namespace to distinguish the behaviour.
I'm in favour of making a minimal IntervalSet
type
would be good to add some tests in here for your use case as well
Yeah, this is a bit of a weird edge case where we'd need to test for a behaviour that wasn't defined here. I've become rather skittish of extending functions if there's concern that the new method doesn't align with the original concept/definition in Base.
Seems like the typical people-who-care-are-watching isn't working
Yeah, I don't think anyone was ever assigned to keep an eye on this repo after you left. I guess for now it'd probably be best to just assign myself or @fchorney to larger reviews for now as we're the next biggest contributors. I'll raise this maintenance concern more broadly to see if there's a more sustainable option.
I'm in favour of making a minimal IntervalSet type
I was hoping that namespacing would be the simpler solution, but given how many things we already extend in base I think you're probably correct. Moving things from the Base to Intervals namespaces would likely be more breaking.
This defines a set of functions for easily computing intersections, unions etc... of a set of intervals (passed as an array).
cc: @omus, @ericphanson, @kleinschmidt
This is based off of a closed PR to TimeSpans (https://github.com/beacon-biosignals/TimeSpans.jl/pull/11)