rust-itertools / itertools

Extra iterator adaptors, iterator methods, free functions, and macros.
https://docs.rs/itertools/
Apache License 2.0
2.72k stars 309 forks source link

Iterators generating (many) vectors #722

Closed Philippe-Cholet closed 1 year ago

Philippe-Cholet commented 1 year ago

Problem: slow because of repetitive heap-allocation

The 5 iterators below generate Vec<...> items.

multi_cartesian_product(self)
combinations(self, k: usize)
combinations_with_replacement(self, k: usize)
permutations(self, k: usize)
powerset(self)

It can be slow, because of the repetitive allocations, even if the user does not need to permanently store each permutation.

I think that in most cases, the user uses each vector to produce something else and then each vector is discarded. In that case, because those iterators generate (up to) many many items (product of the lengths, factorial-related, powers of two), it creates many temporary vectors which can slow down code execution.

Proposed solution: map slices

An alternative iterator could own one single (internal/private) vector and only give a reference to it to a function implementing FnMut(&[...]) -> R. Those iterators would then generate R items instead of vectors.

// current methods unchanged, new methods:
multi_cartesian_product_map<R, F>(self, func: F)
combinations_map<R, F>(self, k: usize, func: F)
combinations_with_replacement_map<R, F>(self, k: usize, func: F)
permutations_map<R, F>(self, k: usize, func: F)
powerset_map<R, F>(self, func: F)
// all with  F: FnMut(&[...]) -> R

Usage examples

it.combinations(k) and it.combinations_map(k, ToOwned::to_owned) would produce the same elements, but the first should be used!

it.combinations(k).map(closure) and it.combinations_map(k, closure) could produce the same elements if it does not require a vector. For what I tested, it's then roughly 4 to 5 times faster to go through all combinations, unless the iterator has very few elements (then .map should be used).

In it.permutations(5).filter_map(|perm: Vec<_>| some_option), if some_option does not need perm to be an owned vector then it could become it.permutations_map(5, |perm: &[_]| some_option).flatten() where there is no heap-allocation for each permutation.

Implementation (commits here)

New module vec_items (behind use_alloc feature) with:

E.g. for combinations_map:

Very similar for the 4 others. powerset[_map] is a bit simpler because it just uses combinations[_map] internally ; and it was simpler for multi_cartesian_product than anticipated.

Closing thoughts

Avoid cloning in MapSlice case? I think(?) it might be possible to only have references: MapSlice { func: F, vec: Vec<&T> }. Then F: FnMut(&[&T]) -> R and no cloning. But I struggle with the lifetimes (or more probably some hard barrier). But is it worth it? Probably not if it's just integers, but otherwise?

The documentation I did not wrote yet would suggest to use the non _map variants if the user needs owned vectors and not slices (or if there is very little elements expected).

PS: I got the idea while replacing (0..nb).permutations(nb) (after using flamegraph and heaptrack) by a _map version of mine of permutohedron::Heap to avoid any (internal) heap-allocation while keeping a nice iterator structure. And for 8! == 40320 times, that was also 5 times faster.

phimuemue commented 1 year ago

Related: https://github.com/rust-itertools/itertools/pull/682

Philippe-Cholet commented 1 year ago

I saw that PR some weeks or months ago but forgot about it, I only remember the crate name and that I thought it seemed quite complex. My approach with trait VecItems is simple but quite efficient and flexible enough to make an alternative iterator for any iterator generating vectors like I did here for all 5 iterators.

Philippe-Cholet commented 1 year ago

@phimuemue After diving in a bit, I agree that the problem lies with "Iterator" itself and that a "lending iterator" would be a better choice but I share the opinion that it should not be a maintenance burden to have an (evolving) additional dependency and that it would be preferable to wait it reaches the "core" if it is even a possibility (I don't know).

I like my workaround but I guess I should close this issue?

Easyoakland commented 1 year ago

@Philippe-Cholet I don't think you should close it.

@phimuemue This sounds similar to https://github.com/rust-lang/rust/pull/94667 and https://github.com/rust-lang/rust/pull/82413. This PR makes sense since it appears to be how this kind of thing is being handled in std. I still think the lending solution is better but it doesn't look like that will happen any time soon.

phimuemue commented 1 year ago

@Easyoakland Thanks for the link, I did not know about map_windows.

Still, I am unsure that going this way is the best solution:

To sum up: The proposed solution improves performance, but does not offer maximum convenience (due to complicating the API), and it is unclear if the performance could be even better (my guess is "it could be better with reasonable amount of work, depending on your use case").

As such, it "falls in the middle" (neither most convenient nor most performant). If that's impossible (or hard) to offer both convenience and performance, then the API should imho not sacrifice at both ends. Thus, I am reluctant to merge this, and would resort to using other, more efficient, algorithms if needed.

@Philippe-Cholet Do you happen to have a comparison for your permutations_map vs. your patched permutohedron::Heap_mapped? If Heap_mapped can be significantly faster than permutations_map, then I'd interpret this as a sign that having and maintaining permutations_map is questionable.

Philippe-Cholet commented 1 year ago

@phimuemue I agree it's in the middle as you say it. I think it remains convenient enough (and the old methods remain there unchanged) for a nice speed up but it definitely is not perfect (clone + populate whole vector + clear...), mainly because I did not have to change the internals too much. This workaround was generic enough to modify all 5 five iterators quite easily. Without it, I guess I would have to discard the lazy buffer for something else, and do different things for each iterator.

I did not compare permutations_map with "permutohedron::Heap_mapped" but the permutations are generated in different orders. Heap's algo is designed to be the fastest while permutations[_map] create permutations in the natural order. Compare them did not seem fair. But it might tell us something. ... Well it does tell us there is real room for improvement for permutations_map:

Time Permutations of (0..10).collect_vec() Comment
235ms permutohedron::Heap::new(&mut vec).map(\|_\| 0).last()
324ms permutations(vec.into_iter(), 10).map(\|_\| 0).last()
8.08ms heap::permutations_map(&mut vec, \|_\| 0).last() permutohedron::Heap wrapped by me
87.6ms permutations_map(vec.into_iter(), 10, \|_\| 0).last() MapSlice way

The first two lines tell us how much Heap's algo is faster than getting the natural order. Both allocates each item to a vector, while the last two do not. I guess we could have expected permutations_map to be roughly 8 times faster. I did not thought it would be that bad.

Well I'm closing this. And I'll reopen only if I try and really fasten it.

EDIT: It was a nice exercise and I learned that it has to be really fast if it's not the most convenient, no compromise.

phimuemue commented 1 year ago

@Philippe-Cholet Wow, thanks a lot for you effort.

I agree with all your conclusions.

Philippe-Cholet commented 1 year ago

For permutations (and 4 others), we could discard my Itertools::permutations_map<R,F> and add a method map_slices<R,F> that would convert Permutations<I> to PermutationsMap<I,F> to do the same job as before. It's just a nicer API in my opinion, exact same performance (not great yet but that might eventually change).

Name it map would be a nasty breaking change because it would override Iterator::map and has a quite different signature.