Open oberien opened 7 years ago
It might be cleaner to turn Intersection
into an iterator filter that can be chained. Worth keeping in mind as a possible alternative.
While that does seem pretty nice from the API perspective, iterators are meant to be iterated over only once. Given an arbitrary iterator, we'd need to create a new iterator with a buffer which we collect all items into, just to be able to perform a nested loop join. While this can be very optimized with specialization for sets, it is very heavy and imperformant for most other cases. So in my opinion it does not fit the nearly no-cost iterator operations we have currently.
Intersections speed can be optimized by using the same BuildHasher
for both HashMap
s and tweaking the intersection algorithm to dig into the guts of the HashMap
and exploit that both use the same hasher. It might be easier to express that dependence statically with dependent types, but doing so may go beyond even @ticki's proposals. You can exploit it dynamically though and the multiple intersection case can exploit it even more :
Add Ord
, Eq
, etc. to RandomState
. Make your Intersect function ask for HashMap<K,V,B>
where B: BuildHasher+Ord+Eq
. First it sorts the HashMap
s according to their keys B
s and runs the fast intersection the HashMap
s with equal B
s. Next it runs the slower intersection on the remaining HashMap
s with unequal B
s. In principle, you might prefer trait objects &BuildHasher+Ord+Eq
but that'd require changing HashMap
itself.
Just fyi, this speedy intersection is completely trivial for Cuckoo tables, and O(n)
, but you can do a speedy one for Robin-Hood hash tables like HashMap
too that should be amortized O(n)
. I'd assume you get O(n log n)
for intersections that do not exploit equality of B
.
I guess currently the way to do it is with fold
? But it looks rather ugly, note also the difference between union
and intersection
:
let mut sets: Vec<HashSet<char>> = Vec::new();
sets.push(['a', 'b', 'c', 'd'].iter().cloned().collect());
sets.push(['c', 'd'].iter().cloned().collect());
sets.push(['d', 'a'].iter().cloned().collect());
let union = sets
.iter()
.fold(HashSet::new(), |acc, hs| {
acc.union(hs).cloned().collect()
});
let intersection = sets
.iter()
.skip(1)
.fold(sets[0].clone(), |acc, hs| {
acc.intersection(hs).cloned().collect()
});
println!("Union: {:?}", union);
println!("Intersection: {:?}", intersection);
Or is there a better way?
@Stannislav heh, doing Advent of Code? 😛
This is what I did:
let mut lines = group
.lines()
.map(|line| line.chars().collect::<HashSet<_>>());
let mut s = lines.next().unwrap();
for line in lines {
s = s.intersection(&line).copied().collect();
}
I agree it's ugly though. I think the unstable fold_first
method on iterators would make this nicer. That said, it would be nice not to have to continually collect back into a HashSet
.
FWIW, I think what we're missing is iterator adaptors that work on already-sorted data. It should be possible to implement efficient N-way intersection on sorted iterators without introducing a HashSet at all.
@Stannislav heh, doing Advent of Code? 😛
Ha, got me 🎄 ! Here's my solution
Anyway, regarding the actual topic, I saw there's an external reduce
crate, I haven't tried it though. I'm surprised there's no reduce
in the standard library.
Edit: just found the fold_first
dev issue here, seems like reduce
is one of the proposed names as well.
Currently the method intersection of HashSet and BTreeSet only support an intersection between two sets. If you want to intersect multiple sets, for example 3, you'll need to calculate the intersection between the first two, collect them into another set and calculate the intersection between that immediate result and the third set. This is very intense on computation, memory usage and lines of code.
I'd like to suggest the following:
Intersection
structs to contain a list of other sets to intersect with instead of a reference to a single other set. This applies tostd::collections::btree_set::Intersection
as well asstd::collections::hash_set::Intersection
.intersect_many
(orintersect_multiple
) onHashSet
andBTreeSet
which take a list of other sets to all intersect with the current one.intersect
andintersect_many
(orintersect_multiple
) to theIntersection
structs as well. Those can only be called if theIntersection
hasn't been iterated over yet. Otherwise, they'll panic.If (3.) is implemented, the "list of sets to intersect with" will need to be a
Vec
in order to be growable after creation. For performance reasons, the third suggestion could be dropped and the "list" can instead be a reference to the provided slice.The current implementation of
{Hash,BTree}Set::intersection
would need to be changed to pass&[other]
instead ofother
. While this request should be relatively easy to implement forHashSet
(self.others.iter().all(|set| set.contains(elt))
, implementing it forBTreeSet
could result in a lot more code.Unresolved Questions:
intersect_many
vsintersect_multiple
vs ???Intersection
struct) in favor of using an allocation free slice-reference instead of aVec
?