Support for Intersection of multiple Sets at the same time

oberien commented 7 years ago

Currently the method intersection of HashSet and BTreeSet only support an intersection between two sets. If you want to intersect multiple sets, for example 3, you'll need to calculate the intersection between the first two, collect them into another set and calculate the intersection between that immediate result and the third set. This is very intense on computation, memory usage and lines of code.

I'd like to suggest the following:

Change the current implementations of the Intersection structs to contain a list of other sets to intersect with instead of a reference to a single other set. This applies to std::collections::btree_set::Intersection as well as std::collections::hash_set::Intersection.
Implement intersect_many (or intersect_multiple) on HashSet and BTreeSet which take a list of other sets to all intersect with the current one.
Add intersect and intersect_many (or intersect_multiple) to the Intersection structs as well. Those can only be called if the Intersection hasn't been iterated over yet. Otherwise, they'll panic.

If (3.) is implemented, the "list of sets to intersect with" will need to be a Vec in order to be growable after creation. For performance reasons, the third suggestion could be dropped and the "list" can instead be a reference to the provided slice.

The current implementation of {Hash,BTree}Set::intersection would need to be changed to pass &[other] instead of other. While this request should be relatively easy to implement for HashSet (self.others.iter().all(|set| set.contains(elt)), implementing it for BTreeSet could result in a lot more code.

Unresolved Questions:

Naming: intersect_many vs intersect_multiple vs ???
Don't implement the third suggestion (functions on the Intersection struct) in favor of using an allocation free slice-reference instead of a Vec?

le-jzr commented 7 years ago

It might be cleaner to turn Intersection into an iterator filter that can be chained. Worth keeping in mind as a possible alternative.

oberien commented 7 years ago

While that does seem pretty nice from the API perspective, iterators are meant to be iterated over only once. Given an arbitrary iterator, we'd need to create a new iterator with a buffer which we collect all items into, just to be able to perform a nested loop join. While this can be very optimized with specialization for sets, it is very heavy and imperformant for most other cases. So in my opinion it does not fit the nearly no-cost iterator operations we have currently.

burdges commented 7 years ago

Intersections speed can be optimized by using the same BuildHasher for both HashMaps and tweaking the intersection algorithm to dig into the guts of the HashMap and exploit that both use the same hasher. It might be easier to express that dependence statically with dependent types, but doing so may go beyond even @ticki's proposals. You can exploit it dynamically though and the multiple intersection case can exploit it even more :

Add Ord, Eq, etc. to RandomState. Make your Intersect function ask for HashMap<K,V,B> where B: BuildHasher+Ord+Eq. First it sorts the HashMaps according to their keys Bs and runs the fast intersection the HashMaps with equal Bs. Next it runs the slower intersection on the remaining HashMaps with unequal Bs. In principle, you might prefer trait objects &BuildHasher+Ord+Eq but that'd require changing HashMap itself.

Just fyi, this speedy intersection is completely trivial for Cuckoo tables, and O(n), but you can do a speedy one for Robin-Hood hash tables like HashMap too that should be amortized O(n). I'd assume you get O(n log n) for intersections that do not exploit equality of B.

Stannislav commented 3 years ago

I guess currently the way to do it is with fold? But it looks rather ugly, note also the difference between union and intersection:

    let mut sets: Vec<HashSet<char>> = Vec::new();
    sets.push(['a', 'b', 'c', 'd'].iter().cloned().collect());
    sets.push(['c', 'd'].iter().cloned().collect());
    sets.push(['d', 'a'].iter().cloned().collect());

    let union = sets
        .iter()
        .fold(HashSet::new(), |acc, hs| {
            acc.union(hs).cloned().collect()
        });
    let intersection = sets
        .iter()
        .skip(1)
        .fold(sets[0].clone(), |acc, hs| {
            acc.intersection(hs).cloned().collect()
        });

    println!("Union: {:?}", union);
    println!("Intersection: {:?}", intersection);

Or is there a better way?

Diggsey commented 3 years ago

@Stannislav heh, doing Advent of Code? 😛

This is what I did:

            let mut lines = group
                .lines()
                .map(|line| line.chars().collect::<HashSet<_>>());
            let mut s = lines.next().unwrap();
            for line in lines {
                s = s.intersection(&line).copied().collect();
            }

I agree it's ugly though. I think the unstable fold_first method on iterators would make this nicer. That said, it would be nice not to have to continually collect back into a HashSet.

Diggsey commented 3 years ago

FWIW, I think what we're missing is iterator adaptors that work on already-sorted data. It should be possible to implement efficient N-way intersection on sorted iterators without introducing a HashSet at all.

Stannislav commented 3 years ago

@Stannislav heh, doing Advent of Code? 😛

Ha, got me 🎄 ! Here's my solution

Anyway, regarding the actual topic, I saw there's an external reduce crate, I haven't tried it though. I'm surprised there's no reduce in the standard library.

Edit: just found the fold_first dev issue here, seems like reduce is one of the proposed names as well.

rust-lang / rfcs

Support for Intersection of multiple Sets at the same time #2023

Unresolved Questions: