Rewrite parallel_compress_sequential_iter using rayon

progval commented 1 year ago

Motivations:

allow ParallelIterator input instead of forcing callers to sequentialize their node iterators
fix unsoundness regarding the behavior of Clone + Iterator: the code previously assumed cloning an iterator would split it into chunks, but this is not guaranteed (items could be duplicates or interspersed).
remove manual thread management
avoid special-handling for the first and last chunks.

The changeset of this pull request may be hard to read because a large part of the code didn't change except for their indent level or variable names. I recommend reading commit-per-commit.

This alters the way work is distributed: rayon is now in charge of picking the number of threads and how to distribute chunks across threads. On my machine, the timings (i7-8700, 12 threads), this has no impact on the timings written by test_par_comp.

Regarding the use of a macro, I wish I could do this purely within the type system, but `::chunks` and `IndexedParallelIterator::chunks` are surprisingly hard to unify.

I went as far as this, and it still didn't compile: ```rust pub trait MaybeParallelChunks { type Item; } impl<'a, I: 'a, Iter: Iterator> MaybeParallelChunks for itertools::Chunks<'a, Iter> { type Item = I; } pub trait IntoMaybeParallelChunks { type Chunk: IntoIterator; type Chunks<'a>: MaybeParallelChunks + 'a where Self: 'a; fn my_into_iter<'a>(self) -> Self::Chunks<'a> where Self: 'a; } impl<'b, Item: 'b, Iter: Iterator + 'b> IntoMaybeParallelChunks for &'b itertools::IntoChunks { type Chunk = Vec; type Chunks<'a> = itertools::Chunks<'a, Iter> where 'b: 'a; fn my_into_iter<'a>(self) -> Self::Chunks<'a> where 'b: 'a, { IntoIterator::into_iter(self) } } use std::borrow::Borrow; use std::ops::Deref; pub trait Chunkable { type Item; type IntoChunks where for<'a> Self::IntoChunks: Borrow>; type IntoChunksRef<'a>: IntoMaybeParallelChunks + 'a where Self: 'a, Self::Item: 'a; fn chunks(self, size: usize) -> Self::IntoChunks where for<'a> Self::IntoChunks: Borrow>; } impl> Chunkable for Iter { type Item = Item; type IntoChunks = itertools::IntoChunks; type IntoChunksRef<'a> = &'a itertools::IntoChunks where Iter: 'a; fn chunks(self, size: usize) -> itertools::IntoChunks { self.chunks(size) } } ```

vigna commented 1 year ago

I'll let @zommiommy have a look at this when he's back from DefCon as he wrote the code. One thing tho: test_par_comp uses a toy example, so I think we need a bit more evidence that rayon does the right thing.

One reason for having a fixed number of threads is that in this way we can parallelize exactly. If we do 40 chunks but than rayon picks 39 threads this is going to be bad. To be tolerant to this kind of variance we should have significantly smaller chunks, which however means potentially more open files. But I think files are open sequentially at concatenation, so no big issue.

It is also true that in most cases after 8-10 threads you exhaust the cache/memory/disk bandwidth and then additional threads are useless.

zommiommy commented 1 year ago

Yeah, I'm starting a benchmark on the swh graph so we can verify this. Originally it was with chunks, but it was much slower

progval commented 1 year ago

Also it turns out I won't use ParallelIterator as I expected; it's too hard to get an IndexedParallelIterator from reading a bunch of ORC files. So no big deal if you decide against this change

progval commented 1 year ago

It takes 21h instead of 7h on the SWH graph, so this is probably a bad idea.

vigna / webgraph-rs

Rewrite parallel_compress_sequential_iter using rayon #30