haskell / containers

Assorted concrete container types
https://hackage.haskell.org/package/containers
315 stars 178 forks source link

Binary search for Data.Sequence #769

Closed jamespayor closed 1 month ago

jamespayor commented 3 years ago

Data.Sequence as a data structure supports binary search, but it looks like no methods are currently exposed for doing so.

(Well, not exactly binary search, but you know what I mean - search primitives that take O(log n) time given a monotone predicate.)

In particular, I'd love to have something like this in the API:

-- Splits the sequence at a point where the predicate becomes true.
-- When the predicate is monotone, the split occurs at the first such point.
splitL, splitR :: (a -> Bool) -> Seq a -> (Seq a, Seq a)

... or perhaps something more like:

-- Returns the item in the sequence at which the predicate becomes true.
-- The predicate is passed the item, and the numbers of elements on either side of it in the sequence.
searchL, searchR :: (Int -> a -> Int -> Bool) -> Seq a -> Maybe (Seq a, a, Seq a)

An obvious use case is for querying sorted data, stored in a Seq. But there are also more cunning things you can do, whenever the accumulated data is monotone.

(I know I can go and use Data.FingerTree for this sort of thing, but its constants are way worse than Seq, and I nevertheless do keep wanting to be able to search Seqs.)

jwaldmann commented 2 years ago

splitL, splitR :: (a -> Bool) -> Seq a -> (Seq a, Seq a)

https://hackage.haskell.org/package/containers-0.6.5.1/docs/Data-Sequence.html#v:spanl ?

searchL, searchR :: (Int -> a -> Int -> Bool) -> Seq a -> Maybe (Seq a, a, Seq a)

in the API's naming style, this would be span(l/r)WithIndex, which does not exist right now.

Some functions have _WithIndex variants (e.g., map), some do not (e.g., filter?)

jamespayor commented 2 years ago

Those do linear scans. At the time I lodged this issue, I was interested in binary search for a Seq whose elements are in some kind of sorted order.

jwaldmann commented 2 years ago

I see. So, something like https://hackage.haskell.org/package/containers-0.6.5.1/docs/Data-Set-Internal.html#v:spanAntitone

Seq whose elements are in some kind of sorted order

then why not use Data.Set (can you give an example where that does not work)

Data.Fingertree .. constants way worse than Seq

Wait, what? Data.Sequence implementation is a finger tree.

[EDIT] The only difference (from Data.Fingertree to Data.Sequence) is that the polymorphic measure component is specialized to Int (size)? An ideal compiler should be able to generate efficient code, and not require manual specialisation? Well, then maybe this can serve as a GHC test case.

jamespayor commented 2 years ago

Fwiw I'm no longer interested in pushing this issue! You can close it if it seems an unwelcome addition to the API, or unwelcome work.

Aside re Fingertree vs Seq constants: when I benchmarked back in the day there was a large difference. I bet the specialization and unpacking of the measure Int counts for a lot in code size and indirection, but I'm not sure this explains what I saw, or if the benchmarks would still see a large discrepancy.

meooow25 commented 1 month ago

Data.Sequence as a data structure supports binary search, but it looks like no methods are currently exposed for doing so.

(Well, not exactly binary search, but you know what I mean - search primitives that take O(log n) time given a monotone predicate.)

Is there a source for this?
To me it seems infeasible to fit an O(log n) search on the structure.

(I know I can go and use Data.FingerTree for this sort of thing, but its constants are way worse than Seq, and I nevertheless do keep wanting to be able to search Seqs.)

Note that FingerTree also does not allow binary search on the elements. A FingerTree v a allows binary search by v, not a, which is a crucial difference.
A Seq is a specialized FingerTree where v is set to an Int which is the size/count.

So the equivalent of

split :: Measured v a => (v -> Bool) -> FingerTree v a -> (FingerTree v a, FingerTree v a) 

is

split :: (Int -> Bool) -> Seq a -> (Seq a, Seq a)

This does not exist today, but an equivalent effect can be achieved:
The Int -> Bool is only applied to [0 .. size-1], so the split point can be located independently, without using the Seq. Once it is located, we can use the existing splitAt :: Int -> Seq a -> (Seq a, Seq a) to get the desired result.

Coming back to searching by a, Seq supports indexing (in O(log n)), so it is straightforward to apply a binary search on top of that. This results in an O(log^2 n) search. I don't see a way to do better than this.

treeowl commented 1 month ago

Just for fun: does the big-O change any if we use splits instead of lookups? That gets something like

log n + (log n - 1) + (log n - 2) + (log n - 3) + ... + 1 (Total of about log n terms)

I don't know what that is off the top of my head.

meooow25 commented 1 month ago

That would also be O(log^2 n).

1 + 2 + ... + m = m (m+1) / 2 = O(m^2).

treeowl commented 1 month ago

That would also be O(log^2 n).

1 + 2 + ... + m = m (m+1) / 2 = O(m^2).

I believe you, but I don't understand your explanation. We have Theta(log^2 n) - Theta(log^n), which is certainly O(log^2 n), but that doesn't explain why the bound is tight.

meooow25 commented 1 month ago

I believe you, but I don't understand your explanation.

Ah, I just set m = log n.

log n + (log n - 1) + (log n - 2) + ... + 1 = 1 + 2 + ... + log n = (log n) (log n + 1) / 2 = Theta(log^2 n)

More directly, you could also apply the Master theorem to T(n) = T(n/2) + O(log n) which yields the same result.

treeowl commented 1 month ago

Ah, I see now. You reversed the order of the sum. Thanks.