Binary vs unary predicates

natecook1000 commented 8 years ago

With respect to Dave, I think focusing on unary predicates over binary ones is a mistake for a few different reasons:

The standard library uses a consistent binary predicate for ordering collections. You can use the exact same binary predicate when calling sort, partition, minElement, and maxElement. Requiring a different predicate would make it harder to use these new APIs and inconsistent with what's there.

Using a binary predicate makes the Comparable and predicate versions parallel. Like the existing methods described above, using a binary predicate would make these versions the same, allowing easy reversal of the direction of the sort and interoperation between methods.

var a = [5, 3, 2, 6, 4, 1]
// This is equivalent to 'a.sort(isOrderedBefore: <)':
a.sort()                               // [1, 2, 3, 4, 5, 6]
a.binarySearch(2)                      // Optional(1)

a.sort(isOrderedBefore: >)             // [6, 5, 4, 3, 2, 1]
a.binarySearch(2, isOrderedBefore: >)  // Optional(4)

Using a unary predicate requires flipping the semantics for upperBound. This is a problem, since equalRange can be implemented in terms of lowerBound and upperBound if the two methods use the same predicate, but not if one uses isOrderedBefore and one uses isOrderedAfter. The same problem is seen in the implementation of binarySearch: without a binary predicate, there's no way to test the equivalence of the lower bound.

What do you think? In my view the most appropriate method signatures would be:

// where Generator.Element: Comparable
func lowerBound(value: Generator.Element) -> Index
func upperBound(value: Generator.Element) -> Index
func equalRange(value: Generator.Element) -> Range<Index>
func binarySearch(value: Generator.Element) -> Index?

// general case
func lowerBound(
    value: Generator.Element,
    isOrderedBefore isOrderedBefore: (Generator.Element, Generator.Element) -> Bool
    ) -> Index
func upperBound(
    value: Generator.Element,
    isOrderedBefore isOrderedBefore: (Generator.Element, Generator.Element) -> Bool
    ) -> Index
func equalRange(
    value: Generator.Element,
    isOrderedBefore isOrderedBefore: (Generator.Element, Generator.Element) -> Bool
    ) -> Range<Index>
func binarySearch(
    value: Generator.Element,
    isOrderedBefore isOrderedBefore: (Generator.Element, Generator.Element) -> Bool
    ) -> Index?

lorenzoracca commented 8 years ago

I agree to using a binary predicate too as it is similar to the rest of the API. However, I still do not know how to implement the binarySearch function on it.

natecook1000 commented 8 years ago

If you have a binary predicate, you can flip the order of the parameters to check for equivalence. Let's say you're searching for value using isOrderedBefore: (Element, Element) -> Bool:

Perform a lowerBound search for value to find index i.
The element at the returned position (self[i]) is guaranteed to be not less than value. That is, self[i] >= value.
To see if you've matched value, you need to try to rule out the self[i] > value case. Then you're left with only self == value.
Because you have a binary predicate, you can test self[i] > value with isOrderedBefore(value, self[i]) (since the predicate is required to be symmetric).

j-haj commented 8 years ago

I'm not sure how to link a direct convo from the swift-evolution list but here is a quote from Dave Abrahams to my initial email:

The general form takes a closure that partitions the collection; see http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2001/n1313.html and http://wg21.cmeerw.net/lwg/issue270

I would suggest avoiding the API choice that requires both a comparison closure and an element against which to compare, as a unary partitioning predicate is much simpler.

It sounds like the core team would prefer an unary operator

natecook1000 commented 8 years ago

Interesting, thanks for those links! While it's true that lowerBound can use a unary predicate with a lesser requirement (isOrderedBefore has to only partition the collection rather than be a strict weak ordering), we lose the ability to compose functions properly if they take different predicates. A merge sort algorithm does require a predicate that imposes a strict weak order, and would benefit from using lowerBound and upperBound, so if compatible versions of those methods didn't exist we'd probably still want them. The version with a unary predicate seems like a related, but separate, API.

I think the case for a binary predicate is strong enough that we could convince people of its merits. What do you think is the best approach?

lorenzoracca commented 8 years ago

Thanks for the links Jeff! However, I think that given the impossibility to keep the functions working properly (i.e. that regardless the predicate comparison operator it still gives the right result) with a unary predicate, we should stick to binary.

I think the committee would prefer code that works in every case, even if a little harder to understand, than a not fully working one.

Or if we really were to use it (which honestly would make our lives even easier!), we'd have to thoroughly explain the usage (and its inherent limits) via docs.

Should we ask Dave again? I exchanged few emails with Apple swift developer Dmitri Gribenko for little advice before contacting the newsletter, I could ask him again what he thinks about our problem.

That's my opinion though :)

Haravikk commented 8 years ago

While I'd love to be able to just pass the same predicate in, I don't like having to also pass a value to satisfy it.

Is it possible that we could provide both forms? For example:

func lowerBound(isOrderedBefore:(T)->Bool) -> Index { … }
func lowerBound(value:T, predicate:(T,T)->Bool) -> Index {
    return self.lowerBound(){ predicate($0, value) }
}

This could be a good middle-ground, as lets developers pass a predicate directly if they want to, alongside an element to compare against, otherwise for calling with a trailing closure the unary option is simpler.

I've been working on an ordered collection type recently, and I made the choice early on to not require Comparable elements, and instead use an isOrderedBefore closure keep everything sorted. The problem I came across was efficiently implementing the contains() and indexOf() methods that take closures, as these cannot be satisfied in anything other than a linear fashion, so I had to settle for separate methods that were more efficient thanks to partitioning predicates, but I tried a bunch of different combinations including binary predicate with an element, but I just didn't like using it personally, even when passing in the same isOrderedBefore predicate I'd used to sort the elements in the first place, so I've decided to just provide both forms (unary + binary with target) instead so I can use what I like, and others can use what they like.

j-haj commented 8 years ago

@Haravikk I think the issue is in getting everything to play nicely together if we use the unary predicate.

@natecook1000 @lorenzoracca I think the unary predicate approach is best if it's possible. Of course the issue is we haven't been able to get the composition working properly if we take this approach. I propose we move forward using a binary predicate with search value. It shouldn't be too much work to get everything up and running with this approach. Maybe in parallel (or after we get the binary approach working) we message Dimitri and/or Dave and ask for their input on how to handle upperBound. Right now I am leaning towards it not being possible to implement upperBound using the unary predicate from lowerBound.

What do you all think?

Edit: I just re-read @natecook1000 's post and like the idea of making two sets of implementations, one that is a little more restricted in terms of requiring the collection to be Comparable and another more general implementation that requires a binary predicate and search value. I vote we got that route for now until we get some more input from the core team.

lorenzoracca commented 8 years ago

I agree with you @j-haj . I think the way we should move forward is as @natecook1000 put it, with two sets of implementations following different extension rules.

By the way, thank you all for the continued support and (to whom it may concern) have a good Spring Break.

Haravikk commented 8 years ago

Ah I see, I think I've been misunderstanding, yeah it seems they may need the binary predicate plus element, as a unary predicate lower bound would be synonymous with a unary predicate binary search anyway.

For reference though, here's the code I'm using right now (maybe not 100% exactly, I had to type out from memory):

extension CollectionType {
    func binarySearch(isOrderedBefore:(Generator.Element)->Bool) -> Index {
        var low = self.startIndex, high = self.endIndex
        while low != high {
            let mid = low.advancedBy(low.distanceTo(high) / 2)
            if isOrderedBefore(self[mid]) { low = mid.successor() }
            else { high = mid }
        }
        return low
    }
}

let myArray = [1, 2, 3, 4, 5, 5, 5, 6, 7, 8, 9]
let isOrderedBefore:(Int, Int)->Bool = { $0 < $1 }
let theTarget = 5

let lowerBound = myArray.binarySearch({ isOrderedBefore($0, theTarget) })
let upperBound = myArray.binarySearch({ !isOrderedBefore(theTarget, $0) })
let isMatch = (lowerBound != myArray.endIndex) && !isOrderedBefore(theTarget, myArray[lowerBound])
let range = lowerBound ..< upperBound

Throw that into a playground and range will be 4 ..< 7 as expected (and isMatch is true). This only needs one search function and seems very elegant to me.

To define these as methods you could do:

extension CollectionType {
    func lowerBound(theTarget:Generator.Element, isOrderedBefore:(Generator.Element, Generator.Element)->Bool) -> Index {
        return self.binarySearch({ isOrderedBefore($0, theTarget) })
    }

    func upperBound(theTarget:Generator.Element, isOrderedBefore:(Generator.Element, Generator.Element)->Bool) -> Index {
        return self.binarySearch({ !isOrderedBefore(theTarget, $0) })
    }
}

In this case the binary search itself is a unary predicate, but the more specific cases actually take binary ones that map nicely onto the underlying search.

Removed part about limiting by range.

natecook1000 commented 8 years ago

@Haravikk Indices are shared between collections and their slices, so instead of this:

let i = mySortedCollection.binarySearch(for: aValue, withinRange: aRange)
print(mySortedCollection[i!])

you do this:

let i = mySortedCollection[aRange].binarySearch(for: aValue)
print(mySortedCollection[i!])

That way you don't need a version of the method that takes a range, too.

j-haj commented 8 years ago

Just to keep this convo alive, here is a message Dave Abrahams sent the list this afternoon:

Having an overload that accepts a binary predicate is certainly a nice convenience, but the most general formulation takes a unary predicate that “partitions” the collection, i.e. returns false for the first N elements of the collection and returns true for the rest.

IMO it's important to expose the unary predicate version. Lots of times, the thing you want to compare against doesn't have the same type as the elements of the collection. For example, you might have a collection of key-value pairs where you just want to compare against the keys, and you may not even be able to create an instance of the whole element. For more on this, see http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2001/n1313.html

Dave

Haravikk commented 8 years ago

Personally I'm leaning towards the solution being that the binary search method take a unary predicate, and that we don't bother with lower/upper bounds as they just add noise; the binary search method is the important part and would implement both other methods anyway, plus the developer is then free to transform whatever they like into a unary predicate.

In my example above I think the code is pretty simple without the addition of the lower/upper bound methods; a clear example of do an upper bound search in the binary search documentation should be sufficient IMO.

j-haj commented 8 years ago

@Haravikk your example above I think illustrates why this isn't as simple as it seems. Using your binarySearch implementation, I get the following results:

let unaryPred = { $0 < 3 } // This is the predicate we will test
let test = [1, 2, 3, 4, 5] // should return true
let test2 = [2, 1, 3, 5, 4] // should return true
let test3 = [1, 2, 4, 5] // should return false

print(test.binarySearch({$0 < 3})) // 2 - Correct
print(test2.binarySearch({$0 < 3})) // 2 - Correct
print(test3.binarySearch({$0 < 3})) // 2 -Incorrect

Because the problem we are facing is the same for both binarySearch and upperBound, once we get binarySearch working with a unary predicate, the upperBound implementation will be trivial. Personally, I'm not convinced this is possible unless we can return something other than True/False from the predicate. Then again, maybe I'm misunderstanding the true nature of binarySearch.

Additionally, it might be worth asking more people on their thoughts of returning a Bool vs Index. Personally, I'm in favor of a Bool implementation as it more closely aligns with the C++ implementation and Swift already as the indexOf(_:) method.

Haravikk commented 8 years ago

Actually the result of test3 is correct; the index returned is where the value 3 should be, i.e- in that case it is the insertion index that you would use to do test3.insert(3, atIndex: 2), thus adding it into the correct position, i.e- the index returned is always useful, you still have to determine whether it's correct if you wanted to test for the existence of an element. The method may also return .endIndex, indicating that all elements failed to match, i.e- the new element should be appended.

If you need to verify that a value was actually found then you would do the following:

let isOrderedBefore = { $0 < $1 }
let theIndex = test3.binarySearch({ isOrderedBefore($0, 3) })
let isMatch = (theIndex != test3.endIndex) && !isOrderedBefore(3, $0)

The last line here tests whether theIndex is actually the item you wanted by checking that it isn't the end index and that the target value isn't greater than the value at the given index. Basically, due to the way the search works, you know that if theIndex < .endIndex then the element it identifies is less than 3, so if it also isn't greater than 3 then you have a match, without needing a test for equality.

Also, test2 only succeeds by accident; .binarySearch() should only be called on sorted collections, otherwise you should use .indexOf(), a requirement of calling .binarySearch() should be that the collection was sorted first (or is sorted automatically).

j-haj commented 8 years ago

I disagree. binarySearch should tell you if an item is in the collection and I think it is confusing to change that meaning. Suppose I wanted to search a dictionary for the word "motel", it is counterintuitive to me that a binarySearch algorithm would return a positive result (either true or an index) if the dictionary, in fact, did not contain the word "motel". A more concrete example, if I asked a store attendant if they had shampoo and they gave me an aisle number, I would assume that means "yes, we have shampoo and you can find it here" and not "we may or may not have shampoo, if we do it's located here".

Your second point actually illustrates the issue at hand - to validate whether the element is found requires a binary predicate (since you are swapping the comparison order of the search value).

Lastly, test2 is an important test case. We don't want to require that the collection is sorted, rather we only need the collection to be partitioned (i.e., partially ordered) by the search value (e.g., [2, 1, 3, 5, 4] is partitioned by 3).

Haravikk commented 8 years ago

I disagree that it should return a useful value only if something is found. The binary search gets as close as it can to finding the value, but it may not actually find it, and that's a very useful property IMO, as it means that even if you don't get a match you do get an insertion index.

The issue comes down to one of naming; for example, you could do:

extension CollectionType {
    func sortedIndexOf(theNeedle:Generator.Element, isOrderedBefore:(Generator.Element, Generator.Element) -> Bool) -> Index? {
        let theIndex = self.binarySearch({ isOrderedBefore($0, theNeedle) })
        if (theIndex != self.endIndex) && !isOrderedBefore(theNeedle, self[theIndex]) { return theIndex }
        return nil
    }
}

extension CollectionType where Self.Generator.Element:Comparable {
    func sortedIndexOf(theNeedle:Generator.Element) -> Index? {
        return self.sortedIndexOf(theNeedle) { $0 < $1 }
    }
}

Point is, it's better to have the more flexible implementation, as everything else can be built around it; don't like the name binary search then name it something else, but the last thing I want is a binary search that I can only use for a single purpose.

j-haj commented 8 years ago

I like the idea of a nearestIndexToElement type function, but I think that is a separate proposal from binarySearch

EDIT: At a minimum, I think this needs to be discussed on swift-evolution as it is a pretty material departure from the standard interpretation of binarySearch.

natecook1000 commented 8 years ago

There are a few different algorithms here that are all tightly related. I think there are separate questions at work at once: what algorithms should be in the standard library? What should those algorithms be called? Just so we can have a common frame of reference, these are the relevant methods that the C++ STL defines:

partition_point (part of C++11): a binary search using a unary predicate that returns an index that divides the collection into two parts, one where every element satisfies the predicate and one where every element fails the predicate. The implementation is precisely what @Haravikk posted above as a unary binarySearch.
lower_bound: a binary search using a value and a binary predicate (which can be omitted for comparable types) that returns the lower insertion point for the value (Haravikk's implementation is correct, this can be written using partition_point).
upper_bound: a binary search using a value and a binary predicate (which can be omitted for comparable types) that returns the upper insertion point for the value (again, Haravikk's implementation is correct, this can be written using partition_point). If the value is not in the collection, upper_bound and lower_bound return the same index.
binary_search: a binary search using a value and a binary predicate (which can be omitted for comparable types) that returns true iff the value is found.
equal_range: a binary search using a value and a binary predicate (which can be omitted for comparable types) that returns the lower and upper bound of the value.

I think all of these serve their own purpose, except for the binary_search method—having it return a Boolean makes you search again if you want to do anything with the found element, which stinks. An optional index would be the perfect return value here, since it communicates both the presence and the location of the value.

I've definitely come around to the importance of providing a method that takes a unary predicate, but I think it should be a more general partition_point implementation rather than the lower or upper bound methods, which are derived versions of that. The name of that method is tricky, since the Swift partition doesn't act like the C++ version of partition—Swift's uses a binary predicate and picks its own partition point rather than accepting a unary predicate and partitioning the collection on that. Swift's partition feels like an exposed internal method of the sort algorithm rather than a useful API on its own. Do you see any point in having Swift's current partition method outside of an intermediate sorting step?

A slightly more ambitious proposal would be to replace partition with a unary predicate version and add partitionPoint to the methods I listed above. I'll list my proposed methods below.

The binary_search and equal_range methods can't be implemented with a unary predicate, because a unary predicate doesn't provide for an equivalence test.

/// Rearranges the elements of the collection and returns a pivot where every
/// element in `self[startIndex..<pivot]` satisfies the predicate and every element
/// in `self[pivot..<endIndex]` fails the predicate.
///
/// NOTE: This replaces the existing `partition()` method, which should be renamed
/// or moved to be an internal implementation detail of `sort()`.
mutating func partition(predicate: (Generator.Element) -> Bool) -> Index
/// Returns a pivot index where every element in `self[startIndex..<pivot]` satisfies the 
/// predicate and every element in `self[pivot..<endIndex]` fails the predicate.
/// - Precondition: The collection must already be partitioned according to `predicate`.
func partitionPoint(predicate: (Generator.Element) -> Bool) -> Index

// general search methods
func lowerBound(
    value: Generator.Element,
    isOrderedBefore isOrderedBefore: (Generator.Element, Generator.Element) -> Bool
    ) -> Index
func upperBound(
    value: Generator.Element,
    isOrderedBefore isOrderedBefore: (Generator.Element, Generator.Element) -> Bool
    ) -> Index
func equalRange(
    value: Generator.Element,
    isOrderedBefore isOrderedBefore: (Generator.Element, Generator.Element) -> Bool
    ) -> Range<Index>
func binarySearch(
    value: Generator.Element,
    isOrderedBefore isOrderedBefore: (Generator.Element, Generator.Element) -> Bool
    ) -> Index?

// where Generator.Element: Comparable
func lowerBound(value: Generator.Element) -> Index
func upperBound(value: Generator.Element) -> Index
func equalRange(value: Generator.Element) -> Range<Index>
func binarySearch(value: Generator.Element) -> Index?

Whatever we choose to propose at this point, we're going to have a healthy "Alternatives" section! :joy:

j-haj commented 8 years ago

@natecook1000 I really like this approach a lot! I emailed Dave and the evolution list this morning to see if we could get any more input on how he sees the implementation of a unary predicate based binarySearch. I agree that it doesn't seem possible because of the inability to handle equivalence, but since he suggested it I figured it'd be a good idea to check with him.

natecook1000 commented 8 years ago

Actually, if we add the partitionPoint method, do we need lowerBound and upperBound at all?

j-haj commented 8 years ago

I can't think of a scenario where paritionPoint and lowerBound give different results. It seems that upperBound is sufficiently different that it's worth keeping (especially for cases where the partition value is repeated). For symmetry it seems like it might be worth keeping lowerBound if we have upperBound, although having to methods that do the same thing seems like a bad idea in this case. I'm not sure on this one...

natecook1000 commented 8 years ago

If you have a binary predicate, upperBound can always be written like @Havarikk showed:

let upperBound = myArray.partitionPoint({ !isOrderedBefore(theTarget, $0) })

I think these methods are "advanced" enough that if a developer would be using upperBound they could figure out how to use partitionPoint for the same purpose...

j-haj commented 8 years ago

Ah ok - yea that makes sense. Actually, the more I think about it I think having partitionPoint is quite a bit more clear in its result than lowerBound and upperBound. Rather than the user dealing with whether or not certain points are included in the bound etc. they simply ask "at what point does this predicate partition this collection?"

natecook1000 commented 8 years ago

Alright, so new proposed APIs are below. @Haravikk & @lorenzoracca, what do you think?

/// Rearranges the elements of the collection and returns a pivot where every
/// element in `self[startIndex..<pivot]` satisfies the predicate and every element
/// in `self[pivot..<endIndex]` fails the predicate.
///
/// NOTE: This replaces the existing `partition()` method, which should be renamed
/// or moved to be an internal implementation detail of `sort()`.
mutating func partition(predicate: (Generator.Element) -> Bool) -> Index
/// Returns a pivot index where every element in `self[startIndex..<pivot]` satisfies the 
/// predicate and every element in `self[pivot..<endIndex]` fails the predicate.
/// - Precondition: The collection must already be partitioned according to `predicate`.
func partitionPoint(predicate: (Generator.Element) -> Bool) -> Index

// general search methods
func equalRange(
    value: Generator.Element,
    isOrderedBefore isOrderedBefore: (Generator.Element, Generator.Element) -> Bool
    ) -> Range<Index>
func binarySearch(
    value: Generator.Element,
    isOrderedBefore isOrderedBefore: (Generator.Element, Generator.Element) -> Bool
    ) -> Index?

// where Generator.Element: Comparable
func equalRange(value: Generator.Element) -> Range<Index>
func binarySearch(value: Generator.Element) -> Index?

If those APIs look okay, we can bikeshed the names. :laughing:

lorenzoracca commented 8 years ago

Hey I'm sorry guys, I'm on holiday and Wi-Fi is awful!

This looks like a big change from the typical binary search API, but that obviously isn't a problem. As long as we have the go from swift-evolution, I think these functions would actually be the best starting point for any programmer.

I'll start working on this asap.

Haravikk commented 8 years ago

This looks good, but I have few points:

Might want to clarify a bit more that partitioning has no further requirements on the order of elements before and after the partition point, just to reinforce that they're not necessarily being sorted by that operation.
Also .partitionPoint()'s precondition should perhaps mention that the collection can be either partitioned or sorted, since a collection sorted with .sort({ $0 < $1 }) will satisfy .partitionPoint({ $0 < 5 }) or whatever, which is where the method really becomes useful, as the actual partitioning step will give you the index anyway. Otherwise we might consider a .sortedIndexOf() method so that each operation has its corresponding lookup? Not sure what's best for this.

The only other question is whether .binarySearch() is the right choice of name? Obviously it's historically what other APIs use and how this is being implemented, but it may not really be necessary to convey how the search is performed, we just need developers to know that it's fast but requires their collections to be sorted first. Just wanted to ask about thoughts on calling it .sortedSearch() or something that's a bit less specific, as more specialised collections might have other ways to find elements (e.g- some kind of lookup table).

lorenzoracca / Swift-binary-search

Binary vs unary predicates #5