liborty commented 2 years ago

Improved Binary Search

The problem

Quoting from: primitive.slice: "If there are multiple matches, then any one of the matches could be returned. The index is chosen deterministically, but is subject to change in future versions of Rust. If the value is not found then Result::Err is returned, containing the index where a matching element could be inserted while maintaining sorted order."

Some of these issues are since to some extent addressed by partition_point with a different functionality.

Motivation, use-cases

A typical application is efficient removal of duplicates of a particular value. For that, we need to find all (partially) equal items.
In probability/statistics, we often want to count the number of occurrences of a particular event/value, intermixed amongst many other values/measurements within a sample. Then the solution is to sort the sample and then deploy my improved binary search, where range.end-range.start gives the answer. We can then keep the sorted list and possibly test for some other events/values in the same way.
Suppose you have a matrix/spreadsheet/database and instead of the usual retrieval of a value for some given coordinate(s), you need to solve the inverse problem, that is to find all the (pairs of) coordinates of where a particular value occurs.

Solution

The proposed solution has the following new benefits:

It only requires PartialOrd instead of Ord.
It works on both ascending and descending orders.
It only searches the specified input Range. This will often be quicker.
The data can come from any source in address (index) space of type T and is only sampled (lazily) at the binary search points.
The existing std binary_search varieties use Err mechanism to deliver a genuine sort-order result, which is arguably not quite right. This is corrected. No errors are returned.
It solves the ambiguity about which (of possibly several) match positions is returned. The result is now unambiguous: output range from the sort position of the first occurrence of the sought item till the last. Empty range n..n, means that the item was not found and n is its insert position. This mechanism is logically consistent and avoids doing Error processing of useful results.

Links and related work

I have implemented the proposal as function binary_find. It can be presently seen in my crate indxvec.

the8472 commented 2 years ago

A typical application is efficient removal of duplicates. For that, we need to find all (partially) equal items, not just one random equal item.

Can yo provide more details? Like how that isn't covered by sorting (which has to be done anyway for a binary search) and then using Vec::dedup?

It works on both ascending and descending orders,

I don't see how this is necessary for deduplication.

It solves the ambiguity hinted at in the above quote from std documentation.

But so does partition_point?

liborty commented 2 years ago

Thanks for your perceptive questions.

Yes, it is true that Vec::dedup will, in principle, solve the outlined example application. In practice, my proposal will often make it more efficient. For example, when we want to remove only duplicates of a specific value (possibly from lots of lists). Then it is wasteful to search linearly all of them for all the duplicates. We might not even want to remove the other duplicates at all. Sorry, I should have written 'duplicates of some value' instead of just 'duplicates', implying all duplicates of all values, which is a different thing. I hear you say: "ok, so I pass a closure to dedup_with that tests for equality of only that value." Yes but then, to my knowledge, it will not deploy binary search.

Different orders may not be necessary for deduplication but when you already have a sort of a very long list, you do not want to have to reverse it just because your binary search can cope with only one direction. Also, there are many other applications for a properly implemented general binary search.

Yes, I acknowledge that the partition_point also solves the Err returns but, then again, it has somewhat different functionality. I actually started by looking at binary_search, since it is there and I found it inadequate for my needs. Initially, I did not even suspect that there might be another method with an unrelated name which solves similar problem somewhat better. Yes, I should have studied the documentation more thoroughly but I offer this as an argument for implementing binary_search properly, precisely because it is there and that is what people use first.

the8472 commented 2 years ago

but when you already have a sort of a very long list, you do not want to have to reverse it just because your binary search can cope with only one direction.

Or you can use Reverse:

    let slice = &[100usize, 80, 77, 4, 1];
    assert_eq!(Ok(1), slice.binary_search_by_key(&Reverse(&80), |e| Reverse(e)));

liborty commented 2 years ago

but when you already have a sort of a very long list, you do not want to have to reverse it just because your binary search can cope with only one direction.

Or you can use Reverse:
    let slice = &[100usize, 80, 77, 4, 1];
    assert_eq!(Ok(1), slice.binary_search_by_key(&Reverse(&80), |e| Reverse(e)));

Yes, or just use a complement index: slice.len()-idx. When you have several occurrences, it will also reverse your resulting range. I am not disputing that there are alternative ways of programming anything with anything :) The point I would make is that if you try this with plain binary_search, which returns a random occurrence, you will end up in quite a mess.

scottmcm commented 2 years ago

I'll cc https://github.com/rust-lang/rfcs/issues/2184, where something like this has been discussed (inspired by https://en.cppreference.com/w/cpp/algorithm/equal_range) for almost 5 years now.

I think the -> Range<usize> version of binary search makes sense to offer. I don't know how to spell its name, though.

liborty commented 2 years ago

Yes, if the existing binary_search is to remain, then we need a new short and distinguished name. How about: binary_find ?

As to the functionality, there is an argument for keeping that simple too, as per my suggestion. That is, using implied PartialOrd instead of passing in a full blown comparator closure. After all, as has been rightly pointed out, the partition_point can be bent to that purpose.

PartialOrd is already an improvement in generality on the Ord of the existing binary_search.

Regards the above discussion about working on both ascending and descending orders: yes, that is to some extent optional but I have included it because it is practically a zero cost exercise ( a single test of a bool ).

I note that if you accidentally use the wrong order, the existing binary_search gives misleading results.

I am going to detect the order automatically, for convenience and safety.

scottmcm commented 2 years ago

I'm not on libs-api, but I'd be very surprised if a bool like that for the different orders got in.

.binary_search_range(x, true) is neither readable nor rememberable, and while .binary_search_range(x, Order::Ascending) is readable, it's also super-verbose and the extra enum doesn't really pull its weight.

See also how there's no bool on sort for which way you want it sorted.

What you might consider instead -- as a separate ACP! -- would be something to make it easier to use cmp::Reverse to handle this situation.

For example, if there was a method to go &[T] -> &[cmp::Reverse<T>], then you'd have things like

[5, 4, 2, 1].bikeshed_this().binary_search(Reverse(3)) // Err(2)
[1, 4, 2, 3].bikeshed_this_mut().sort() // [4, 3, 2, 1]

liborty commented 2 years ago

See my above edit. I propose .binary_find(x) with order detected automatically.

PS: there is bool on my implementations of mergesort and hashsort but that is another story. I see ascending and descending orders as essentially equivalent and no reason to implicitly assume any one of them.

shepmaster commented 2 years ago

You could go for a builder style API. Hand-wavy and without real names...

a_slice.binary_finder().first()
a_slice.binary_finder().last()
a_slice.binary_finder().any()
a_slice.binary_finder().range()

liborty commented 2 years ago

Your suggestion can be used to unite my proposal, let us call it .binary_find(), with the existing binary_search().

All of those fields can be trivially obtained from my binary_find returned range. Except any, which is returned by the current binary_search.

programmerjake commented 2 years ago

imho it'd be useful to be able to binary search on non-slice things, e.g. a custom data type that is indexable, but data is not contiguous in memory.

liborty commented 2 years ago

I am offering also my .binsearch_indexed(), which works much in the same way. The name can be changed.

liborty commented 2 years ago

It works like this (self is a slice of any type &[T]):

    /// Binary search of an index sorted slice in ascending or descending order. 
    /// Like `binsearch` but using indirection via sort index `idx`.
    fn binsearch_indexed(self, idx:&[usize], val:&T) -> Range<usize> 
        where T: PartialOrd;

Where idx is obtained by:

    /// Makes a sort index for self, using key generating closure `keyfn`
    fn keyindex(self, keyfn:fn(&T)->f64, ascending:bool) -> Vec<usize>;

The beauty of it is that T can be any arbitrarily complex user defined type but as long as you supply the keyfn (closure), the index is built using my super-fast hashsort (over f64 keys). I could subsume keyindex within binarysearch_indexed but I find that having the explicit index available is often useful for other purposes as well.

When calling keyindex, the closure can capture any additional variables it might need. I use it in rstats crate to sort whole multidimensional vectors (as T), with keys derived as any (scalar) function on the vectors, such as their distance from some point. Where the distance constitutes a PartialOrd for the purpose of the search.

programmerjake commented 2 years ago

It works like this (self is a slice of any type &[T]):

    /// Binary search of an index sorted slice in ascending or descending order. 
    /// Like `binsearch` but using indirection via sort index `idx`.
    fn binsearch_indexed(self, idx:&[usize], val:&T) -> Range<usize> 
        where T: PartialOrd;

that's nice, but it doesn't actually do what imho is needed, which is more like:

// TODO: pick better name
// TODO: add versions with u32, u64, and u128 instead of usize -- needed
// because you might be searching on disk or over the network rather
// than in-memory and usize can be too small.
// TODO: switch to using Try trait rather than Result.

// Note how this needs no slice to be constructed in order to
// search, not everything that can be searched fits in memory or
// wants to pay the up-front cost of building a slice, e.g. if you're
// searching in something with 400 trillion entries, nearly no one
// has that much memory to construct a slice of, nor wants to
// spend hours filling memory with indexes before it can conduct
// a simple <1s search.
pub fn try_bin_search_algo<E>(mut search_in: Range<usize>, mut cmp_at: impl FnMut(usize) -> Result<Ordering, E>) -> Result<Result<usize, usize>, E> {
    // hope i got this right
    while !search_in.is_empty() {
        let mid = (search_in.end - search_in.start) / 2 + search_in.start;
        match cmp_at(mid)? {
            Ordering::Less => search_in.end = mid,
            Ordering::Greater => search_in.start = mid + 1,
            Ordering::Equal => return Ok(Ok(mid)),
        }
    }
    Ok(Err(search_in.start))
}

liborty commented 2 years ago

So, you envisage combining this with a random disk/internet access? I have not thought that far. Does that not warrant a separate method?

PS. I think you also need to check that the sought value is actually inside the range, i.e be more careful about the Err value.

programmerjake commented 2 years ago

So, you envisage combining this with a random disk/internet access?

yes, or something else other than just an index-into-slice-and-compare, e.g.:

// compute sqrt(a) via binary search.
try_bin_search_algo(0..(1 << 32), |v| Ok(if Some(v2) = v.checked_mul(v) { a.cmp(&v2) } else { Ordering::Less }))

I have not thought that far. Does that not warrant a separate method?

probably...

liborty commented 2 years ago

Yes, like I said, I can subsume the keyindex method and just pass the closure to the binary search directly. You are quite right that doing the comparisons with a closure probe 'lazily', saves time and space for generating all the key values. Thank you for that!

cuviper commented 2 years ago

It seems to me that there is enough design space left that this isn't ready for the standard library yet.

programmerjake commented 2 years ago

This makes me think that Rust needs something in the standard library like C++'s <algorithms> header, where we can put binary search, sorting algorithms (which also need a version that doesn't need a slice to sort), and more.

Maybe core::algo::*?

CAD97 commented 2 years ago

I decided against passing in a comparator closure, as the same effect can be achieved, imho in a cleaner way, by implementing PartialOrd for your custom data type T

Just to note, this isn't always possible. E.g., in string interners, it's really nice to store

type Span = Range<u32>;
type Symbol = NonZeroU32;
pub struct Interner<S> {
    hasher: S,
    string_to_symbol: HashSet<Symbol, ()>,
    symbol_to_span: Vec<Span>,
    span_to_string: String,
}

with then hash/comparison being of &str <=> &span_to_string[symbol_to_span[symbol]]. If the key actually stored in the data structure you're searching is a key and not an index, it cannot implement Partial/Ord/Eq directly; it needs to store an actual reference back to the shared resource. And in a case like this, that's not even really possible, since the String can be appended to and reallocate, invalidating extant references.

This is just what I'm the most familiar with, but this pattern of raw_entry usage to do "dependency injection" for comparisons is a surprisingly commonly available optimization once you know how to look for it.

liborty commented 2 years ago

I struggle to understand this. Given that you are being asked to implement PartialOrd on your own custom type T, how is it not possible to include the same struct fields in T?

CAD97 commented 2 years ago

See https://github.com/CAD97/strena as an example. The HashMap raw_entry API allows you to provide dependency injected fn hash(&K) -> u64 and fn eq(&Q, &K) for lookup.

liborty commented 2 years ago

Then why not just do your binary search over the symbols?

CAD97 commented 2 years ago

Because the symbols aren't sorted intrinsically. They're indexes, sorted by the string which they reference, which to recover you need to provide outside data. ex.:

impl Interner {
    pub fn get(&self, s: &str) -> Option<Symbol> {
        let string_to_symbol: &[Symbol] = &self.string_to_symbol;
        let symbol_to_span: &[Span] = &self.symbol_to_span;
        let span_to_string: &str = &self.span_to_string;

        let range = binary_find(0..string_to_symbol.len(), |ix| {
            span_to_string[symbol_to_span[string_to_symbol[ix]].clone()].cmp(s)
        });
        Some(range)
            .filter(|range| !range.is_empty())
            .map(|range| string_to_symbol[range.start])
    }
}

adjusted implementation

```rust pub fn binary_find(range: Range, cmp: impl Fn(usize) -> Ordering) -> Range { use Ordering::*; // `last` finds the exclusive end of the range of the matching items let last = |ix: usize| -> usize { (ix + 1..range.end) .find(|&i| cmp(i) == Greater) .unwrap_or(range.end) }; // `first` finds the inclusive start of the range of the matching items let first = |ix: usize| -> usize { (range.start..ix) .rfind(|&i| cmp(i) == Less) .map_or(range.start, |i| i + 1) }; // Checking for errors, special cases and order if range.is_empty() { return range; }; match (cmp(range.start), cmp(range.end - 1)) { // searchee is before range.start (Greater, _) => return range.start..range.start, // searchee is beyond range.end (_, Less) => return range.end..range.end, // search range is all equal (Equal, Equal) => return range, // searchee at range.start (Equal, _) => return range.start..last(range.start), // searchee at range.end (_, Equal) => return first(range.end - 1)..range.end, _ => (), } // Clean binary search let mut hi = range.end - 1; // initial high index let mut lo = range.start; // initial low index loop { let mid = lo + (hi - lo) / 2; // binary chop here with truncation if mid > lo { // still some range left match cmp(mid) { // still too low Less => lo = mid, // still too high Greater => hi = mid, // found it! Equal => return first(mid)..last(mid), } } else { // interval is exhausted, searchee not found return hi..hi; }; } } ``` -----

liborty commented 2 years ago

Yes but now every cmp closure has to check for the descending order on every invocation, which is not practical. If you can solve that problem, I will be happy to adopt your proposal.

Also, we can not return an empty range when the searchee is outside it, as the empty range is reserved for indicating the insert order of a missing searchee within the range. However, this is not such a problem and if you consult my latest code, you will see that I have solved it already by returning the range limit in the direction of the outlying index as Err(limit).

I like the probe closure being explicit. I think it makes the code much easier to follow than burying it, the searchee and everything, inside the opaque captured environment of cmp. This seems just as bad as any global variables ever were. Any reason why we can not have both probe and cmp?

CAD97 commented 2 years ago

Yes but now every cmp closure has to check for the descending order on every invocation

No, because you just prestore that information in the closure the same way you're doing currently. Or use cmp::Reverse, or whatever; Rust's sorting APIs do not assume which order you want, they require you to put in the correct ordering.

If anything, this is better, because with this version you can pass in a static order whereas your code always does a dynamic comparison ordering.

Also, we can not return an empty range when the searchee is outside it, as the empty range is reserved for indicating the insert order of a missing searchee within the range.

If I return begin..begin, that means that the insertion position is before the first element, right? Same for end..end is after the end element. If you want to know if the element is strictly contained in the range, this isn't the API that's serving that purpose.

Though that does mean the end..end return at the end is wrong. (I mostly just copied your impl, assuming it good, fixing up types.) It should be mid..mid, where mid is the insertion point to retain a sorted order.

I like the probe closure being explicit. I think it makes the code much easier to follow than burying it, the searchee and everything inside the opaque cmp captured environment. This seems just as bad as any global variables ever were. Any reason why we can not have both probe and cmp?

I'm going to turn this question around. What benefit is there to having a separate index and compare step? Having it in one step saves a (potentially) dynamic call. Rust likes its stateful closures; in fact, this should probably even take FnMut.

Having a unified compare does have benefits, in that it (potentially) lowers monomorphization pressure, but more importantly, your API requires the key to be sized, copy, 'static (because you extract the key by value fn(usize) -> T) whereas using a single cmp closure allows the indexed values to be ephemeral to the comparison closure, by reference or by value.

Also, always indexing in u128 is going to be a big problem. u128 is emulated and slow on plenty of platforms, and you physically can't have more than isize::MAX district items anyway.

We probably need to get the bounds by continuing the binary search as well, rather than falling back to a linear probe. Or at a minimum use some sort of exponential probe to avoid the worst case performance of [0, [1; BIG]..., 2].

CAD97 commented 2 years ago

Actually, begin..begin means that the insertion position is at the first element.

AIUI, given a sorted two-element slice [ a , b ] and binary searching for x, the correct semantics are

0..0 => [ x , a , b ].is_sorted()
1..1 => [ a , x , b ].is_sorted()
2..2 => [ a , b , x ].is_sorted()
0..1 => x == a && x < b
1..2 => a < x && x == b
0..2 => x == a && x == b

and these are all possible outcomes of a binary search on a sorted list. Where do we disagree?

I suppose, if this is a subslice, insertion at the edges may not maintain the entire slice being sorted. However, this is still a determinable output condition, and there's no reason to assume that binary searching a subslice that your item is not in will give you a position that keeps the superslice still sorted; the correct answer is to say inserting at the edge would keep the input slice range sorted.

I think the more important distinction, though, is that "insertion at an element" is meaningless. You insert between elements. The range 0..0 is an empty range which is before the first element, as is observed with Vec::splice.

When I say "insertion position is before the first element," I do mean the position directly before the first element.

I'm also going to repeat my counterquery for visibility:

What benefit is there to having a separate index and compare step?

Because the benefit of not is that a separate index step really wants a signature of fn index(&mut self, ix: usize) -> Self::Key<'_>; that is, return a key which can optionally borrow from the container and only needs to be valid until the next call to index. (Think "random access lending iterator".)

liborty commented 2 years ago

0..0 => [ x , a , b ].is_sorted() ... and these are all possible outcomes of a binary search on a sorted list. Where do we disagree?

Ah, I see what is going on! I was referring to the subscripts of the input range, as supplied. Assuming that any insertion process would on seing 0..0 insert x in position 0 and shift the rest one up. Whereas you are assuming that this has already been done and your subscripts are referring to the new slice. So, we do not disagree.

On the count of having an all-in-one comparator, I think you persuade me about the benefits.

I accept that the current functions/methods implicitly assume that everything is in ascending order but that seems to me an unwarranted assumption. In any case, I would like to relax it. I think that having both orders automatically recognised and correctly acted upon has benefits in generality and code robustness. Presumably, this will now have to be done externally to binary_find. Where an if statement similar to mine will be issuing two different calls on the binary_find, with two different codes for the cmp closure, one involving Reverse? Seems more demanding on the user but, as you say, not everyone will necessarily need that.

Thank you for your clarifications, I find them very useful.

CAD97 commented 2 years ago

Presumably, this will now have to be done externally, issuing two different calls on the binary_find

No, one call is sufficient; just do a let cmp = if ascending { |x| key.cmp(x) } else { |x| key.cmp(x).reverse() } like indxvec currently does internally, or let cmp = |x| { let c = key.cmp(x); if ascending { c } else { c.reverse() } }. (The latter might have better inlining behavior.)

I accept that the current functions/methods implicitly assume that everything is in ascending order but that seems to me an unwarranted assumption.

For intrinsic ordering, I might agree with you. But when injecting a comparator/ordering, it is absolutely reasonable to assume that the ordering is the same that the ordered input is ordered by.

Whereas you are assuming that this has already been done and your subscripts are referring to the new slice.

... no? Given index..index, I should do vec.insert(index, item) to insert it and maintain the sort order. Yes, that results in vec[index] being item, but this is because we inserted it at that index. The index is both the insertion index and the index after insertion, because that's just restating the same thing twice.

liborty commented 2 years ago

Thanks to your excellent suggestions we are making progress. I have now implemented all of them apart from replacing also the linear search of matching items with binary search. In a few extreme cases it will be much quicker but also, in many usual cases where there are only a few matching items, it will be slower. If there is a feeling that it is worth it, I can certainly do that: basically a recursive call with a comparator that looks at two adjacent items to find the edge cases.

The biggest latest improvement I made is generic search Range, so that the user can choose what index type suits best: usize, u128, or anything else (satisfying the generic trait bounds). Here is an example application to root finding, using f64:

use core::ops::Range;
#[test]
fn roottest() { 
    let num:f64 = 1234567890.0;
    let root:f64 = 5.3;
    let range = broot(num,root);
    println!("{} to the power of {YL}1/{}{UN}\nbinary_find:\t{} error: {}\npowf:\t\t{} error: {}", 
    num.yl(), root, range.start.gr(), (num-range.start.powf(5.3)).gr(),
    num.powf(1./root).gr(),(num-num.powf(1./root).powf(root)).gr()); 
}

///num.powf(1./root) using binary search
fn broot(num:f64,root:f64) -> Range<f64> {
    binary_find(1_f64..num,
        |&probe| { 
            if probe.powf(root) < num { Less }
            else if probe.powf(root) > num { Greater }
            else { Equal }})
}
running 1 test
1234567890 to the power of 1/5.3
binary_find:    51.92543988913025 error: -0.000000476837158203125
powf:           51.92543988913026 error: -0.0000011920928955078125
test roottest ... ok

CAD97 commented 2 years ago

In std, the range index type would probably be bound by Step[^1]. This can theoretically even be accomplished exclusively with step_unchecked and without cloning the range keys, though I don't know what the performance impact of doing so would be; std would probably want to specialize on Copy to use the simple let mid = step_halfway(lo, hi) approach.

[^1]: Step is surprisingly close to C++'s named requirement LegacyRandomAccessIterator / concept random_access_iterator, minus the dereference for item access of course. (Note C++ copy constructors are Rust Clone.)

However, I'm bowing out on design here. I have strong opinions on the proper dependency injection API, but less so on the rest of the details.

(Though I do want to note that in parallel to raw_entry, perhaps this sh/could be named raw_binary_search.)

programmerjake commented 2 years ago

more than just Step would be needed to search in 0..!0u128, since that exceeds the range of usize by a large factor. you'd probably want:

trait StepHalfway: Step {
    /// returns `(self + other) / 2` but without overflowing.
    fn halfway(&self, other: &Self) -> Self;
}

programmerjake commented 2 years ago

also it shouldn't be limited to just integer types, e.g. binary searching a list of SHA1 or SHA256 hashes would be very useful for git stuff. (admittedly that could be done with u128 indexes searching in a ordered list on disk, but >128-bit indexes would still be useful).

other examples: searching for the 10^600-th prime by binary searching on PrimePi, this would use BigInt indexes.

CAD97 commented 2 years ago

Step eventually should be able to be stabilized and implementable for any type that fits its interface.

scottmcm commented 2 years ago

As a meta-comment, this thread feels very much like what I'd expect in a crate design discussion. That's not saying that these ideas are bad, but the super-general version of all this feels like something that belongs in a crate, not in core, since it's far from the plausible original "hey, can we just get one method on slice that's a bit better for these cases".

Especially if it's already implemented in a crate, then people can just use that crate today, as opposed to the many months (at least) it'll take before they could use it in core.

liborty commented 2 years ago

I must be missing something in this discussion of step because binary_find works just fine on u128 already, since several versions back. Also on f64 which is not step, as you can see from my very example above your posts. For this reason, Range indexing is internally avoided. Range is only used to determine the limits of the search. (It is too easy in Rust to overcomplicate matters unnecessarily). I only ever need to add/subtract 1 and divide by 2. With this, I believe I am still within the scope of: "hey, can we just get one method on slice that's a bit better for these cases".

This is the signature:

pub fn binary_find<T,F>(range: Range<T>,cmpr: F ) -> Range<T>
    where T: PartialOrd+Copy+Add<Output=T>+Sub<Output=T>+Div<Output=T>+From::<u8>,
          F: Fn(&T)->Ordering {

Also, I miss the purpose of making the comparator closure FnMut rather than just Fn. Could anyone who thinks it is needed please give a simple example of how it might be actually useful? Would you want to change what you are looking for in the middle of the search? I think not.

programmerjake commented 2 years ago

some simple reasons to have it be FnMut: reusing a http connection, or where you need to cache partial results.

CAD97 commented 2 years ago

I only ever need to add/subtract 1 and divide by 2. With this, I believe I am still within the scope of: "hey, can we just get one method on slice that's a bit better for these cases".

This is the signature:
pub fn binary_find<K, F>(range: Range<K>, cmp: F) -> Range<K>
where
    K: PartialOrd + Copy + Add<Output=K> + Sub<Output=K> + Div<Output=K> + From::<u8>,
    F: Fn(&K) -> Ordering,

(Reformatted to be legible. Please, use rustfmt, at least if you're posting code on rust-lang repositories. Maintaining a consistent style helps code to be quickly understood.)

This signature is still vastly more complicated than any other bound in the standard library APIs. And it's not even what you say it is! You're not requiring "add one and divide by two", you're requiring From<u8> and Div! How would this make any sense for e.g. Uuid, Sha1, or other types which are Step (have predecessor/successor operations) but not just numbers.

(x: Sha1) / Sha1::from(2) is nonsensical.

Step (or ) is the semantically correct bound here, as random-access successor/predecessor is what you need to do a binary search. If Step isn't powerful enough, then it should be generalized and/or extended to support this usecase.

That said, though, binary searching over a dataset with more elements (which is sorted in some fashion!) than the address space is an extremely niche use case. And will likely want to switch strategies when moving from impossibly sparse to some reasonable scale to work on.

liborty commented 2 years ago

Your formatting has actually destroyed my signature snippet. It mismatched the trait arguments and chopped off the end of the bounds. No wonder you have difficulties reading it.

Are we talking about the same thing here? My type T is the type of the index, not of the data. The bounds may look complicated but it is not my fault that there is no Countable trait in Rust. How do you propose to find quickly a midpoint of an uncountable index range? Step or no step?

timvermeulen commented 2 years ago

I second the request to please use rustfmt in these discussions. I've had some (slight) difficulties reading these code snippets because I'm so used to reading Rust code that is formatted using it.

liborty commented 2 years ago

OK, here it is:

pub fn binary_find<T, F>(range: Range<T>, cmpr: &mut  F) -> Range<T>
where
    T: PartialOrd + Copy + Add<Output = T> + Sub<Output = T> + Div<Output = T> + From<u8>,
    F: FnMut(&T) -> Ordering,

liborty commented 2 years ago

and here is the source listing of the latest version (Edit: made the comparator FnMut)

/// General Binary Search
/// Search within the specified Range<T>, which is always ascending.
/// The (indexing) range values can be of any generic type T satisfying the listed bounds.
/// Typically usize for searching efficiently in-memory, u128 for searching whole disks or internet,
/// or f64 for solving equations.
/// Comparator closure `cmpr` is comparing against a search item captured from its environment.
/// The sort order reflected by `cmpr` can be either ascending or descending (increasing/decreasing).
/// When item is in order before range.start, empty range range.start..range.start is returned.
/// When item is in order after range.end-1, range.end..range.end is returned.
/// Normally binary_find returns Range of all the consecutive values
/// that are PartiallyEqual to the sought item.
/// When item is not found, then the returned range will be empty and
/// its start (and end) will be the sort position where the item can be inserted.
pub fn binary_find<T, F>(range: Range<T>, cmpr: &mut F) -> Range<T>
where
    T: PartialOrd + Copy + Add<Output = T> + Sub<Output = T> + Div<Output = T> + From<u8>,
    F: FnMut(&T) -> Ordering,
{
    let one = T::from(1); // generic one
    let two = T::from(2); // generic two
    let lasti = range.end - one;

    // Closure to find the last matching item in direction up/down from idx
    // or till limit is reached. Equality is defined by `cmpr`.
    let scan = |idx: &T, limit: &T, cpr: &mut F, up: bool| -> T {
        let mut probe = *idx;
        let step = |p: &mut T| if up { *p = *p + one } else { *p = *p - one };
        step(&mut probe);
        while cpr(&probe) == Equal {
            // exits at the first non-equal item
            if probe == *limit {
                step(&mut probe);
                break;
            };
            step(&mut probe);
        }
        if up {
            probe
        } else {
            probe + one
        } // into Range limit
    };

    // Checking end cases
    if range.is_empty() {
        return range;
    };

    match cmpr(&range.start) {
        Greater => {
            return range.start..range.start;
        } // item is before the range
        Equal => {
            if cmpr(&range.end) == Equal {
                return range;
            }; // all in range match
            return range.start..scan(&range.start, &lasti, cmpr, true);
        }
        _ => (),
    };
    match cmpr(&lasti) {
        Less => {
            return range.end..range.end;
        } // item is after the range
        Equal => {
            return scan(&lasti, &range.start, cmpr, false)..range.end;
        }
        _ => (),
    };
    // Binary search
    let mut hi = lasti; // initial high index
    let mut lo = range.start; // initial low index
    loop {
        let mid = lo + (hi - lo) / two; // binary chop here with truncation
        if mid > lo {
            // still some range left
            match cmpr(&mid) {
                Less => lo = mid,
                Greater => hi = mid,
                Equal => {
                    return scan(&mid, &range.start, cmpr, false)..scan(&mid, &lasti, cmpr, true)
                }
            }
        } else {
            return hi..hi;
        }; // interval is exhausted, val not found
    }
}

programmerjake commented 2 years ago

once you found one equal item, rather than using a linear search in scan, imho using an exponential search to find the first not-equal item and then binary searching between that and the last equal item would be much more efficient, with worst case runtime O(log N) rather than O(N)

timvermeulen commented 2 years ago

I've seen exponential search used for this, too — still O(log n) in the worst case, but clearly better (or at least fewer comparisons) than binary search for a small number of equal elements.

liborty commented 2 years ago

Great advice, thanks! Anything else you can think of?

liborty commented 2 years ago

I have just realised that I can get even better results by reusing the last bounds of the already performed binary search! Coming up in the next version.

CAD97 commented 2 years ago

The bounds may look complicated but it is not my fault that there is no Countable trait in Rust.

As I've been saying, that trait is Step.

How do you propose to find quickly a midpoint of an uncountable index range?

let distance = Step::steps_between(&start, &end)?;
let mid = Step::forward_unchecked(start, distance / 2);

With Range<impl Step>, it's trivially possible to bridge to working Range<usize> if you're okay with extra clones of the index type:

pub fn binary_find<K: Step>(range: Range<K>, cmp: impl Fn(&K) -> Ordering) -> Range<K> {
    let trans = |i: usize| Step::forward(range.start.clone()).unwrap();
    if let Some(length) = Step::distance_between(&range.start, &range.end) {
        let out = your_binary_find(0..length, |i| cmp(&trans(i));
        trans(out.start)..trans(out.end)
    } else {
        range
    }
}

(Sorry about the broken formatting in my last post I did it on a phone without checking the output and just before I went to sleep.)

liborty commented 2 years ago

That sounds good. Do you have any clearer idea, how it actually obtains the distance_between? Because if it just linearly counts the steps, that would be too slow. If not, then I will attempt to change to it next.

Here is my current version which implements everything up to here, except that Step. It now includes binary search for the end(s) of the matching range, restricted to the confines of the final range of the binary search that found the initial match. Which is a great idea, even if I say so myself :)

Edit: I can find only steps_between and that is nightly only experimental feature.

programmerjake commented 2 years ago

pub fn binary_find<K: Step>(range: Range<K>, cmp: impl Fn(&K) -> Ordering) -> Range<K> {
    let trans = |i: usize| Step::forward(range.start.clone()).unwrap();
    if let Some(length) = Step::distance_between(&range.start, &range.end) {
        let out = your_binary_find(0..length, |i| cmp(&trans(i));
        trans(out.start)..trans(out.end)
    } else {
        range
    }
}

so if you want to search in the range 0..!0u128 it just always returns the input range!? that's just plain broken imho. Step by itself currently doesn't have the functionality needed to efficiently binary search ranges larger than usize.

CAD97 commented 2 years ago

I keep saying the same things.

Yes, Step is unstable.
- But that doesn't matter.
- We're not discussing the implementation of a crate that works on stable.
- We're discussing a new unstable API to potentially add to std.
Yes, Step is limited to getting a max distance of usize.
- But because Step is unstable, we can change that if desired.
- Probably remove steps_between from Step and add it to a subtrait.
- It's not necessarily needed, though, as you can binary search buckets and then binary search within the buckets.

e.g. smth like untested

fn split(x: i128) -> (i64, u64) {
    (((x as u128) >> 64) as u64 as i64, x as u128 as u64)
}
fn unsplit(hi: i64, lo: u64) -> i128 {
    lo as i128 & ((hi as u64 as u128) << 64) as i128
}

fn search(range: Range<i128>, cmp: impl Fn(i128) -> Ordering) -> Range<i128> {
    let cmp_hi = |&hi| cmp(unsplit(hi, 0));
    let hi = binary_find(split(range.start).0..=split(range.end).0);
    let cmp_lo = |hi| |&lo| cmp(unsplit(hi, lo));
    let lo = binary_find(0..=u64::MAX, cmp_lo(hi.start)).start..binary_find(0..=u64::MAX, cmp_lo(hi.end)).end;
    unsplit(hi.start, lo.start)..unsplit(hi.end, lo.end)
}

(requires binary_find support for RangeInclusive / finite impl RangeBounds, and I remain sad that ranges being Iterator instead of just IntoIterator means RangeInclusive doesn't expose fields.)

rust-lang / libs-team