Open BloodStainedCrow opened 4 weeks ago
Unfortunately, flat_map
is kind of a performance footgun, especially to collect
. It causes another full round of parallel splitting every time, which is quite wasteful for just your 4 elements. It also prevents this from being an IndexedParallelIterator
, so collect
can't write results directly into a final Vec
-- instead it has to create a bunch of temporary Vec
and then merge them later. In theory that would only result in about double the overall memory usage, but the allocator might not be releasing the memory back to the OS for so many differently-sized allocations.
You could use flat_map_iter
for the first problem, so that part will run sequentially in each parallel job, but that doesn't totally solve the indexing / temporary-Vec
problem. Since we can see ourselves that the result will be exactly 4 * len
, you could set that up a bit manually instead, something like:
let mut data = vec![0u8; num_iterations.len() * 4];
data.par_chunks_mut(4).zip(num_iterations).for_each(|(data, num_iterations)| {
// fill in this chunk
});
Hmm, that makes sense. Thanks for the explanation. This seems quite unfortunate, especially since being a drop-in replacement for iter is the main selling point of par_iter and flat_map is (in my experience) often used in this manner. Especially since this performance footgun is this large.
But without specialization, I doubt there is anything that can be done to address this, without also impacting using flat_map with large (and efficiently parallelizable) Iterators. Unfortunate :(
The naive flat_map
inner splitting comes from this line:
https://github.com/rayon-rs/rayon/blob/7b3ef64f3527f30446c9f91ac2e33df2f639cf3a/src/iter/flat_map.rs#L127
That could possibly be improved by keeping the same plumbing::Splitter
instance from the outer to inner part, but I don't see any way to do that with any generic ParallelIterator
. Even if that worked, it would at most achieve the same thing as flat_map_iter
, still not as good as you can do with direct data
writes.
It would be really cool if FlatMap
could tell when the inner iterators have statically-known length, and then it could be a full IndexedParallelIterator
. The standard library has an internal trait ConstSizeIntoIterator
for this, used with real specialization, but maybe we can do similar even with our opt_len
pseudo-specialization. I'm not sure though, because that would also need to be able to split at arbitrary points, including the middle of flattened iterators.
These are hard problems to make the user side easy. :) Did the workaround that I suggested fare better for you?
Yes. I switched from flat_map
to flat_map_iter
. Determining the used amount of memory was suddenly hard, since the program finished this step of the algorithm in fractions of a second (instead of minutes), but I think it is somewhere in the neighborhood of 700MB at most, which seems in line with your predictions. After that worked so well, I did not try preallocating the Vec and passing mutable chunks.
If I understand it correctly, opt_len
does not help, for the same reason ExactSizeIterator
does not help. It guarantees that we know the size of any one iterator, but unlike ConstSizeIntoIterator
does not guarantee ALL of them to be of this size.
It might be possible by adding a way to (if possible) get a len from the type directly (not an instance). For example adding an associated function IndexedParallelIterator::const_len() -> Option(usize)
Since this function would return the same length for all iterators (if possible).
So a parallel Iterator for an [u32; 10]
would return Some(10), a parallel Iterator for a slice would return None, since the size of a slice is not known at compile time.
With that it would be possible to do a sort of runtime specialization.
Crap, I fell into the same trap as the original proposal of ConstSizeIterator
. The size of an Iterator changes...
This would need to be a function on IntoParallelIterator
.
I have experimented a bit with my Idea for a possible const_len
function. I have managed to convert flat_map
such that it now used the direct write version of collect (special_extend
).
This has improved the memory footprint (and therefore also runtime, since most of the runtime seems to be memory allocation).
Where before, running flat_map
on iterators with 10_000 elements (and 4 element arrays as sub-iterators) would consistently get the program SIGKILLed on Linux, it now uses approximately the same amount of memory as flat_map_iter
, which seems like a step in the right direction.
This does not fix the slow runtime of flat_map
, it is 15x slower than flat_map_iter
in my test (which is not representative at all), but that would be because the issue of splitting these tiny tasks is of course still a problem.
But I guess 15x slower vs DNF is a start
If I understand it correctly,
opt_len
does not help, for the same reasonExactSizeIterator
does not help. It guarantees that we know the size of any one iterator, but unlikeConstSizeIntoIterator
does not guarantee ALL of them to be of this size.
It doesn't help the inner iterator, but I meant that the outer FlatMap::opt_len
can return a proper length if the inner iterator has a constant length, i.e. runtime_len * const_inner_len
.
That is exactly what I did: https://github.com/BloodStainedCrow/rayon/blob/main/src/iter/flat_map.rs#L47.
If I understand opt_len
, it does not guarantee that the inner len is constant, right?
That's why I added IntoParallelIterator::const_length
If I understand
opt_len
, it does not guarantee that the inner len is constant, right?
Definitely not, since it can examine &self
for separate instances. Even a new constant method can't really guarantee that, especially as a safe trait to implement, but we can still make that a condition that will likely panic if violated.
That's why I added IntoParallelIterator::const_length
This might be better on ParallelIterator
, since we don't need to worry about Iterator::next
consumption like the standard library. This way it still works after into_par_iter()
and with many adaptors, and FlatMap
can still call the associated <PI as IntoParallelIterator>::Iter::const_len()
or whatever.
Crap, I fell into the same trap as the original proposal of
ConstSizeIterator
. The size of an Iterator changes...
For ParallelIterator
, I don't think it does -- did this bite you in some way?
You are right, I thought I had to implement it on IntoIterator
for that reason, implementing it on Iterator
does make more sense here.
see below
The same assumptions about the length being correct should apply to const_length
and to opt_len
. If they are wrong, the program may crash, or do other nonsensical behavior, but no memory unsafety.
So as long as all "base" consumers, which use unsafe
do safety checks, everything should be fine
The issue with moving it to the ParallelIterator
is, that we do not have seperate types for ParallelIterator
s of e.g. Arrays of different length. So an associated function like const_length
cannot distinguish [0; 100].into_par_iter()
from [0;10].into_par_iter()
.
It is both a slice::IterMut
or whatever. So the type level information of the length is lost. That was another reason to put it in the IntoParallelIterator
trait.
We can distinguish different "Instances" but before we start consuming the iterators we do not have an instance yet. And at that point it is too late
Ah - rayon::array::IntoIter
does know its constant N
, but you're right that this is lost on the array -> slice::Iter[Mut]
.
Maybe we can have both? The blanket IntoParallelIterator for I: ParallelIterator
can forward the call, so stuff like Map
can also propagate length, while array references can report their size directly. (but array.par_iter().map(...)
won't be seen as constant regardless.)
I am not sure I understand you correctly.
Unless we change arrays iter type to something like ArrayParallelIter<N>
and have it be generic over the length of the original Array, I don't think it is possible to obtain the length? Or "prove" it is the same for all Iterators passed into flat_map
.
The current way to obtain a length for flat_map
that I have implemented is this:
fn opt_len(&self) -> Option<usize> {
let base_len = self.base.opt_len()?;
let sub_len = PI::const_length()?;
base_len.checked_mul(sub_len)
}
So ParallelIterator
would need to have a function like
fn const_length() -> Option<usize>
Which I do not see how to implement without changing the types of ArrayIterators (or other's we want to have a size) to include the size in their generics...?
There are 3 implementations of IntoParallelIterator
for arrays:
https://github.com/rayon-rs/rayon/blob/7b3ef64f3527f30446c9f91ac2e33df2f639cf3a/src/array.rs#L14-L39
&[T; N]
and &mut [T; N]
convert to slice iterators, where N
is lost, but the value [T; N]
converts to array::IntoIter<T; N>
. All 3 could report their constant length in a new IntoParallelIterator
method. If ParallelIterator
also gets a new method, then array::IntoIter
can report that, but the slice iterators can't.
array
, array.into_par_iter()
, and array.into_par_iter().map(...)
can report a constant length.&array
can, but (&array).into_par_iter()
/ array.par_iter()
can't, and so further map
etc. can't either.&mut array
can, but (&mut array).into_par_iter()
/ array.par_iter_mut()
etc. can't.FlatMap
would still call the IntoParallelIterator
method, but the blanket impl can expose the length from ParallelIterator
too.
https://github.com/rayon-rs/rayon/blob/7b3ef64f3527f30446c9f91ac2e33df2f639cf3a/src/iter/mod.rs#L2435
Cloned
, Copied
, Enumerate
, Empty
, Once
, Map
, Chain
... a lot of parallel iterators could benefit from this.
I think I see now.
You are right. This could be implemented for most things that implement opt_len
now. Only excluding slice like things.
A bit unfortunate to miss out on &array
/&mut array
though
With this it is also possible to conditionally have an out_len
for e.g. flatten
though it would of course need the same changes as flat_map
in terms of consumption
Okay quick list of everything that now has const_length
Any idea how to get &array in there too? Probably now without changing arrays to stop using slice iterators :(
Unfortunately this is a possibly-breaking change
Never mind. In it's current form it stops ParallelIterator
from being object-safe. So it is a major change currently. I have to restrict const_length
to being where Self: Sized
, then it goes back to possibly-breaking
Nevermind again. Both of them are already are not object-safe. This change is a possibly-breaking change
As an aside: Why is opt_len not implemented for UniformBlocks
? As far as I can tell it could be?
Any idea how to get &array in there too?
IntoParallelIterator for &[T; N]
and &mut [T; N]
can have it, just not their associated type Iter
. That's a gap, but we're still managing more than the standard library is able in this regard, so I think it's ok. (On that note, it's a little sad that flat_map_iter
and flatten_iter
can't be similarly improved here, even though they have the advantage on splitting.)
As an aside: Why is opt_len not implemented for
UniformBlocks
? As far as I can tell it could be?
I think it could, and ExponentialBlocks
too, if they are weaned from UnindexedConsumer
methods -- this should do it:
diff --git a/src/iter/blocks.rs b/src/iter/blocks.rs
index 1691e8b15c3e..29658de3e6cf 100644
--- a/src/iter/blocks.rs
+++ b/src/iter/blocks.rs
@@ -9,7 +9,7 @@ struct BlocksCallback<S, C> {
impl<T, S, C> ProducerCallback<T> for BlocksCallback<S, C>
where
- C: UnindexedConsumer<T>,
+ C: Consumer<T>,
S: Iterator<Item = usize>,
{
type Output = C::Result;
@@ -20,7 +20,7 @@ where
// we need a local variable for the accumulated results
// we call the reducer's identity by splitting at 0
- let (left_consumer, right_consumer, _) = consumer.split_at(0);
+ let (left_consumer, right_consumer, mut reducer) = consumer.split_at(0);
let mut leftmost_res = left_consumer.into_folder().complete();
consumer = right_consumer;
@@ -36,13 +36,14 @@ where
producer = right_producer;
// split the consumer
- let (left_consumer, right_consumer, _) = consumer.split_at(capped_size);
+ let (left_consumer, right_consumer, next_reducer) = consumer.split_at(capped_size);
consumer = right_consumer;
- leftmost_res = consumer.to_reducer().reduce(
+ leftmost_res = reducer.reduce(
leftmost_res,
bridge_producer_consumer(capped_size, left_producer, left_consumer),
);
+ reducer = next_reducer;
}
leftmost_res
}
(Our "specialized" CollectConsumer
doesn't have a problem with to_reducer
, but the opt_len
docs say that no UnindexedConsumer
methods should be used if you return Some
, and a third party consumer might care more.)
I have done some memory profiling. It seems that flat_map
needs so much memory when collecting, because it creates a Vec
and appends it into the LinkedList
for every element of the iterator.
So for
(0..1_000_000).into_par_iter().flat_map(|_| [0, 1, 2, 3]).collect()
It will create 4_000_000 Linked list elements. Each of these contains 2 pointers (next, prev) and the Vec
which is another 3 pointers. So 40 bytes total.
So for a final Vec<u8>
for each byte in the final Vec
a Node
of size 40 will be allocated, which is exactly the memory increase I observed at the start.
While this exact example would benefit from the const_length
improvement,
(0..1_000_000).into_par_iter().flat_map(|_| ParallelIterator::repeat(0).take(4)).collect()
would still inhibit this property
I have done some memory profiling. It seems that
flat_map
needs so much memory when collecting, because it creates aVec
and appends it into theLinkedList
for every element of the iterator.
Yes, that's the effect of the non-indexed (non-opt_len
) collection, combined with the fact that inner flat_map
splitting will certainly split a 4-item iterator entirely. With flat_map_iter
, you'll still get a temporary LinkedList<Vec<_>>
, but it should be better at filling each of those nodes with more items.
So if I understand correctly, flat_map_iter
should still result in 1_000_000 LinkedList
Node
allocations?
No, with flat_map_iter
you'll only get a new node for each of the dynamic splits of the outer iterator. That should not be per-item of 0..1_000_000
, but we can't really predict how many it will be due to work stealing effects on splitting.
If flat_map
always breaks the iterator down into single elements, maybe it is possible to create a LinkedList<T>
instead of a LinkedList<Vec<T>>
? That would bring down the wasted memory from 40 bytes to 16?
No, with
flat_map_iter
you'll only get a new node for each of the dynamic splits of the outer iterator. That should not be per-item of0..1_000_000
, but we can't really predict how many it will be due to work stealing effects on splitting.
A I see, and you mentioned flat_map
will split unconditionally...
If
flat_map
always breaks the iterator down into single elements,
It doesn't in general, but 4 items is so few that this case does end up on single elements.
Ah unfortunate. So back to square one: The problem is that flat_map
uses a naive split...
The only way I can currently see, to stop flat_map
from splitting each sub iterator unconditionally is by introducing a way for a Consumer
to request no splitting to happen. An example could be a function Consumer::max_unconditional_split() -> Option<usize>
which would be unrestricted for all basic Consumer
like for_each
and collect
and would refer to the base for any Consumer
with a base like map
. Then flat_map
could wrap the real Consumer
in a Wrapper which could set this to 0 (or any other value that seems reasonable.
It could also be Consumer::max_unconditional_split(&self)
a member function?
I'll start writing a PR for the const_length
changes, since I already have it implemented. I am going to have to do some more profiling, for that though
rayon 1.10.0
I wrote a small Mandelbrot visualizer including this snippet:
num_iterations is a Vec of length 10_000 10_000, counts a Vec of length 10_000. The largest memory usage I observed was around 380MB which seems to fit the math quite well: 10_000 10_000 * 4 bytes ~= 400MB.
I am on a 6/12 core processor, so the increase is not linear with the number of cores as I first suspected.
When I switch to using
par_iter
The usage spikes to over 16GB (increase of 40x) after a couple of seconds. The runtime is also much slower than single threaded. (I was unable to time it, since it took too long on windows and kept getting OOM killed on Linux).Incase it is useful the repo with the full code is here.