Switching par_iter makes memory usage explode (40x)

BloodStainedCrow commented 4 weeks ago

rayon 1.10.0

I wrote a small Mandelbrot visualizer including this snippet:

let data: Vec<u8> = num_iterations
        .par_iter()
        .map(|num_iterations| (counts[*num_iterations as usize], num_iterations))
        .map(|(counts_cum, num_iterations)| {
            (
                1.0 - f64::from(counts_cum) / (WIDTH * HEIGHT) as f64,
                num_iterations,
            )
        })
        .map(|(brightness, num_iterations)| (brightness.sqrt(), num_iterations))
        .flat_map(|(mut brightness, num_iterations)| {
            if *num_iterations == MAX_ITER {
                brightness = 0.0;
            }

            [
                (brightness * 255.0) as u8,
                (brightness * 255.0) as u8,
                (brightness * 255.0) as u8,
                255u8,
            ]
        })
        .collect();

num_iterations is a Vec of length 10_000 10_000, counts a Vec of length 10_000. The largest memory usage I observed was around 380MB which seems to fit the math quite well: 10_000 10_000 * 4 bytes ~= 400MB.

I am on a 6/12 core processor, so the increase is not linear with the number of cores as I first suspected.

When I switch to using par_iter The usage spikes to over 16GB (increase of 40x) after a couple of seconds. The runtime is also much slower than single threaded. (I was unable to time it, since it took too long on windows and kept getting OOM killed on Linux).

Incase it is useful the repo with the full code is here.

cuviper commented 4 weeks ago

Unfortunately, flat_map is kind of a performance footgun, especially to collect. It causes another full round of parallel splitting every time, which is quite wasteful for just your 4 elements. It also prevents this from being an IndexedParallelIterator, so collect can't write results directly into a final Vec -- instead it has to create a bunch of temporary Vec and then merge them later. In theory that would only result in about double the overall memory usage, but the allocator might not be releasing the memory back to the OS for so many differently-sized allocations.

You could use flat_map_iter for the first problem, so that part will run sequentially in each parallel job, but that doesn't totally solve the indexing / temporary-Vec problem. Since we can see ourselves that the result will be exactly 4 * len, you could set that up a bit manually instead, something like:

let mut data = vec![0u8; num_iterations.len() * 4];
data.par_chunks_mut(4).zip(num_iterations).for_each(|(data, num_iterations)| {
    // fill in this chunk
});

BloodStainedCrow commented 4 weeks ago

Hmm, that makes sense. Thanks for the explanation. This seems quite unfortunate, especially since being a drop-in replacement for iter is the main selling point of par_iter and flat_map is (in my experience) often used in this manner. Especially since this performance footgun is this large.

But without specialization, I doubt there is anything that can be done to address this, without also impacting using flat_map with large (and efficiently parallelizable) Iterators. Unfortunate :(

cuviper commented 4 weeks ago

The naive flat_map inner splitting comes from this line: https://github.com/rayon-rs/rayon/blob/7b3ef64f3527f30446c9f91ac2e33df2f639cf3a/src/iter/flat_map.rs#L127

That could possibly be improved by keeping the same plumbing::Splitter instance from the outer to inner part, but I don't see any way to do that with any generic ParallelIterator. Even if that worked, it would at most achieve the same thing as flat_map_iter, still not as good as you can do with direct data writes.

It would be really cool if FlatMap could tell when the inner iterators have statically-known length, and then it could be a full IndexedParallelIterator. The standard library has an internal trait ConstSizeIntoIterator for this, used with real specialization, but maybe we can do similar even with our opt_len pseudo-specialization. I'm not sure though, because that would also need to be able to split at arbitrary points, including the middle of flattened iterators.

These are hard problems to make the user side easy. :) Did the workaround that I suggested fare better for you?

BloodStainedCrow commented 3 weeks ago

Yes. I switched from flat_map to flat_map_iter. Determining the used amount of memory was suddenly hard, since the program finished this step of the algorithm in fractions of a second (instead of minutes), but I think it is somewhere in the neighborhood of 700MB at most, which seems in line with your predictions. After that worked so well, I did not try preallocating the Vec and passing mutable chunks.

BloodStainedCrow commented 3 weeks ago

If I understand it correctly, opt_len does not help, for the same reason ExactSizeIterator does not help. It guarantees that we know the size of any one iterator, but unlike ConstSizeIntoIterator does not guarantee ALL of them to be of this size.

BloodStainedCrow commented 3 weeks ago

It might be possible by adding a way to (if possible) get a len from the type directly (not an instance). For example adding an associated function IndexedParallelIterator::const_len() -> Option(usize) Since this function would return the same length for all iterators (if possible). So a parallel Iterator for an [u32; 10] would return Some(10), a parallel Iterator for a slice would return None, since the size of a slice is not known at compile time.

With that it would be possible to do a sort of runtime specialization.

BloodStainedCrow commented 3 weeks ago

Crap, I fell into the same trap as the original proposal of ConstSizeIterator. The size of an Iterator changes...

This would need to be a function on IntoParallelIterator.

BloodStainedCrow commented 3 weeks ago

I have experimented a bit with my Idea for a possible const_len function. I have managed to convert flat_map such that it now used the direct write version of collect (special_extend).

This has improved the memory footprint (and therefore also runtime, since most of the runtime seems to be memory allocation). Where before, running flat_map on iterators with 10_000 elements (and 4 element arrays as sub-iterators) would consistently get the program SIGKILLed on Linux, it now uses approximately the same amount of memory as flat_map_iter, which seems like a step in the right direction.

This does not fix the slow runtime of flat_map, it is 15x slower than flat_map_iter in my test (which is not representative at all), but that would be because the issue of splitting these tiny tasks is of course still a problem.

But I guess 15x slower vs DNF is a start

cuviper commented 3 weeks ago

If I understand it correctly, opt_len does not help, for the same reason ExactSizeIterator does not help. It guarantees that we know the size of any one iterator, but unlike ConstSizeIntoIterator does not guarantee ALL of them to be of this size.

It doesn't help the inner iterator, but I meant that the outer FlatMap::opt_len can return a proper length if the inner iterator has a constant length, i.e. runtime_len * const_inner_len.

BloodStainedCrow commented 3 weeks ago

That is exactly what I did: https://github.com/BloodStainedCrow/rayon/blob/main/src/iter/flat_map.rs#L47.

If I understand opt_len, it does not guarantee that the inner len is constant, right? That's why I added IntoParallelIterator::const_length

cuviper commented 3 weeks ago

If I understand opt_len, it does not guarantee that the inner len is constant, right?

Definitely not, since it can examine &self for separate instances. Even a new constant method can't really guarantee that, especially as a safe trait to implement, but we can still make that a condition that will likely panic if violated.

That's why I added IntoParallelIterator::const_length

This might be better on ParallelIterator, since we don't need to worry about Iterator::next consumption like the standard library. This way it still works after into_par_iter() and with many adaptors, and FlatMap can still call the associated <PI as IntoParallelIterator>::Iter::const_len() or whatever.

cuviper commented 3 weeks ago

Crap, I fell into the same trap as the original proposal of ConstSizeIterator. The size of an Iterator changes...

For ParallelIterator, I don't think it does -- did this bite you in some way?

BloodStainedCrow commented 3 weeks ago

~~You are right, I thought I had to implement it on IntoIterator for that reason, implementing it on Iterator does make more sense here.~~

see below

BloodStainedCrow commented 3 weeks ago

The same assumptions about the length being correct should apply to const_length and to opt_len. If they are wrong, the program may crash, or do other nonsensical behavior, but no memory unsafety.

So as long as all "base" consumers, which use unsafe do safety checks, everything should be fine

BloodStainedCrow commented 3 weeks ago

The issue with moving it to the ParallelIterator is, that we do not have seperate types for ParallelIterators of e.g. Arrays of different length. So an associated function like const_length cannot distinguish [0; 100].into_par_iter() from [0;10].into_par_iter().

It is both a slice::IterMut or whatever. So the type level information of the length is lost. That was another reason to put it in the IntoParallelIterator trait.

We can distinguish different "Instances" but before we start consuming the iterators we do not have an instance yet. And at that point it is too late

cuviper commented 3 weeks ago

Ah - rayon::array::IntoIter does know its constant N, but you're right that this is lost on the array -> slice::Iter[Mut].

Maybe we can have both? The blanket IntoParallelIterator for I: ParallelIterator can forward the call, so stuff like Map can also propagate length, while array references can report their size directly. (but array.par_iter().map(...) won't be seen as constant regardless.)

BloodStainedCrow commented 3 weeks ago

I am not sure I understand you correctly.

Unless we change arrays iter type to something like ArrayParallelIter<N> and have it be generic over the length of the original Array, I don't think it is possible to obtain the length? Or "prove" it is the same for all Iterators passed into flat_map.

BloodStainedCrow commented 3 weeks ago

The current way to obtain a length for flat_map that I have implemented is this:

fn opt_len(&self) -> Option<usize> {
    let base_len = self.base.opt_len()?;
    let sub_len = PI::const_length()?;

    base_len.checked_mul(sub_len)
}

So ParallelIterator would need to have a function like

fn const_length() -> Option<usize>

Which I do not see how to implement without changing the types of ArrayIterators (or other's we want to have a size) to include the size in their generics...?

cuviper commented 3 weeks ago

There are 3 implementations of IntoParallelIterator for arrays: https://github.com/rayon-rs/rayon/blob/7b3ef64f3527f30446c9f91ac2e33df2f639cf3a/src/array.rs#L14-L39

&[T; N] and &mut [T; N] convert to slice iterators, where N is lost, but the value [T; N] converts to array::IntoIter<T; N>. All 3 could report their constant length in a new IntoParallelIterator method. If ParallelIterator also gets a new method, then array::IntoIter can report that, but the slice iterators can't.

array, array.into_par_iter(), and array.into_par_iter().map(...) can report a constant length.
&array can, but (&array).into_par_iter() / array.par_iter() can't, and so further map etc. can't either.
Similarly, &mut array can, but (&mut array).into_par_iter() / array.par_iter_mut() etc. can't.

FlatMap would still call the IntoParallelIterator method, but the blanket impl can expose the length from ParallelIterator too. https://github.com/rayon-rs/rayon/blob/7b3ef64f3527f30446c9f91ac2e33df2f639cf3a/src/iter/mod.rs#L2435

Cloned, Copied, Enumerate, Empty, Once, Map, Chain ... a lot of parallel iterators could benefit from this.

BloodStainedCrow commented 3 weeks ago

I think I see now. You are right. This could be implemented for most things that implement opt_len now. Only excluding slice like things. A bit unfortunate to miss out on &array/&mut array though

BloodStainedCrow commented 3 weeks ago

With this it is also possible to conditionally have an out_len for e.g. flatten though it would of course need the same changes as flat_map in terms of consumption

BloodStainedCrow commented 3 weeks ago

Okay quick list of everything that now has const_length

ArrayIter
Either
Chain
Cloned
Copied
Empty
Enumerate
Flat_map
Flatten
MinLen
MapWith
Map
Once
PanicFuse
Rev
Zip

Any idea how to get &array in there too? Probably now without changing arrays to stop using slice iterators :(

BloodStainedCrow commented 3 weeks ago

~~Unfortunately this is a possibly-breaking change~~

Never mind. In it's current form it stops ParallelIterator from being object-safe. So it is a major change currently. I have to restrict const_length to being where Self: Sized, then it goes back to possibly-breaking

Nevermind again. Both of them are already are not object-safe. This change is a possibly-breaking change

BloodStainedCrow commented 3 weeks ago

As an aside: Why is opt_len not implemented for UniformBlocks? As far as I can tell it could be?

cuviper commented 3 weeks ago

Any idea how to get &array in there too?

IntoParallelIterator for &[T; N] and &mut [T; N] can have it, just not their associated type Iter. That's a gap, but we're still managing more than the standard library is able in this regard, so I think it's ok. (On that note, it's a little sad that flat_map_iter and flatten_iter can't be similarly improved here, even though they have the advantage on splitting.)

As an aside: Why is opt_len not implemented for UniformBlocks? As far as I can tell it could be?

I think it could, and ExponentialBlocks too, if they are weaned from UnindexedConsumer methods -- this should do it:

diff --git a/src/iter/blocks.rs b/src/iter/blocks.rs
index 1691e8b15c3e..29658de3e6cf 100644
--- a/src/iter/blocks.rs
+++ b/src/iter/blocks.rs
@@ -9,7 +9,7 @@ struct BlocksCallback<S, C> {

 impl<T, S, C> ProducerCallback<T> for BlocksCallback<S, C>
 where
-    C: UnindexedConsumer<T>,
+    C: Consumer<T>,
     S: Iterator<Item = usize>,
 {
     type Output = C::Result;
@@ -20,7 +20,7 @@ where

         // we need a local variable for the accumulated results
         // we call the reducer's identity by splitting at 0
-        let (left_consumer, right_consumer, _) = consumer.split_at(0);
+        let (left_consumer, right_consumer, mut reducer) = consumer.split_at(0);
         let mut leftmost_res = left_consumer.into_folder().complete();
         consumer = right_consumer;

@@ -36,13 +36,14 @@ where
             producer = right_producer;

             // split the consumer
-            let (left_consumer, right_consumer, _) = consumer.split_at(capped_size);
+            let (left_consumer, right_consumer, next_reducer) = consumer.split_at(capped_size);
             consumer = right_consumer;

-            leftmost_res = consumer.to_reducer().reduce(
+            leftmost_res = reducer.reduce(
                 leftmost_res,
                 bridge_producer_consumer(capped_size, left_producer, left_consumer),
             );
+            reducer = next_reducer;
         }
         leftmost_res
     }

(Our "specialized" CollectConsumer doesn't have a problem with to_reducer, but the opt_len docs say that no UnindexedConsumer methods should be used if you return Some, and a third party consumer might care more.)

BloodStainedCrow commented 3 weeks ago

I have done some memory profiling. It seems that flat_map needs so much memory when collecting, because it creates a Vec and appends it into the LinkedList for every element of the iterator. So for

(0..1_000_000).into_par_iter().flat_map(|_| [0, 1, 2, 3]).collect()

It will create 4_000_000 Linked list elements. Each of these contains 2 pointers (next, prev) and the Vec which is another 3 pointers. So 40 bytes total. So for a final Vec<u8> for each byte in the final Vec a Node of size 40 will be allocated, which is exactly the memory increase I observed at the start.

BloodStainedCrow commented 3 weeks ago

While this exact example would benefit from the const_length improvement,

(0..1_000_000).into_par_iter().flat_map(|_| ParallelIterator::repeat(0).take(4)).collect()

would still inhibit this property

cuviper commented 3 weeks ago

I have done some memory profiling. It seems that flat_map needs so much memory when collecting, because it creates a Vec and appends it into the LinkedList for every element of the iterator.

Yes, that's the effect of the non-indexed (non-opt_len) collection, combined with the fact that inner flat_map splitting will certainly split a 4-item iterator entirely. With flat_map_iter, you'll still get a temporary LinkedList<Vec<_>>, but it should be better at filling each of those nodes with more items.

BloodStainedCrow commented 3 weeks ago

So if I understand correctly, flat_map_iter should still result in 1_000_000 LinkedList Node allocations?

cuviper commented 3 weeks ago

No, with flat_map_iter you'll only get a new node for each of the dynamic splits of the outer iterator. That should not be per-item of 0..1_000_000, but we can't really predict how many it will be due to work stealing effects on splitting.

BloodStainedCrow commented 3 weeks ago

If flat_map always breaks the iterator down into single elements, maybe it is possible to create a LinkedList<T> instead of a LinkedList<Vec<T>>? That would bring down the wasted memory from 40 bytes to 16?

BloodStainedCrow commented 3 weeks ago

No, with flat_map_iter you'll only get a new node for each of the dynamic splits of the outer iterator. That should not be per-item of 0..1_000_000, but we can't really predict how many it will be due to work stealing effects on splitting.

A I see, and you mentioned flat_map will split unconditionally...

cuviper commented 3 weeks ago

If flat_map always breaks the iterator down into single elements,

It doesn't in general, but 4 items is so few that this case does end up on single elements.

BloodStainedCrow commented 3 weeks ago

Ah unfortunate. So back to square one: The problem is that flat_map uses a naive split...

BloodStainedCrow commented 3 weeks ago

The only way I can currently see, to stop flat_map from splitting each sub iterator unconditionally is by introducing a way for a Consumer to request no splitting to happen. An example could be a function Consumer::max_unconditional_split() -> Option<usize> which would be unrestricted for all basic Consumer like for_each and collect and would refer to the base for any Consumer with a base like map. Then flat_map could wrap the real Consumer in a Wrapper which could set this to 0 (or any other value that seems reasonable.

It could also be Consumer::max_unconditional_split(&self) a member function?

BloodStainedCrow commented 3 weeks ago

I'll start writing a PR for the const_length changes, since I already have it implemented. I am going to have to do some more profiling, for that though

rayon-rs / rayon

Switching par_iter makes memory usage explode (40x) #1189