rust-itertools / itertools

Extra iterator adaptors, iterator methods, free functions, and macros.
https://docs.rs/itertools/
Apache License 2.0
2.76k stars 309 forks source link

Add `batch_by`, to allow grouping elements together by their size #992

Open JosephLenton opened 2 months ago

JosephLenton commented 2 months ago

Hey, I would like to propose adding a method batch_by (I don't know if this is the best name, however I'll use it here).

The function signature would be something like:

fn batch_by<N: Num>(max_batch_size: N, batch_fun: F) -> BatchBy<Self, V, F, N>
where
        Self: Sized,
        F: Fn(&Self::Item) -> N,
        N: Num + PartialOrd + Copy,

(I'm not 100% sure on the need for Num. I used it in an implementation I wrote, however I believe it could probably be dropped.)

Pseudo Example Code

const MAX_BATCH_SEND_SIZE: usize = 100;

// Data we are sending
let data_to_send = vec![
    "short-data",
    "very-very-very-...-very-long-data",
    "medium-length-data",
    "short-again",
    "medium-length-data-again",

    // ... imagine lots more data ...
];

// This is the function in use
let batches = data_to_send
    .into_iter()
    .batch_by(MAX_BATCH_SEND_SIZE, |data| data.len());

// Send batches of data
for batch in batches {
    let batch_to_send = batch.collect::<Vec<&'static str>>();
    send(batch_to_send).await?;
}

Motivations

I have personally needed this on several projects when grouping things to be sent on to an external service. For example on a real world project, I needed batch data into 10mb groups to send to ElasticSearch.

Comments

AFAIK there is nothing like this in Itertools. There are ways to group by key, or to find the largest or smallest. A means to say 'put these into groups of 10mb or less' .

I have written this before, so if there is interest I would be more than happy to write a PR for Itertools.

phimuemue commented 2 months ago

Hi, thanks for the idea.

I wonder if chunk_by, batching or (a generalized) coalesce might do the job, too. Did you try these and see how far you get?

Related: https://github.com/rust-itertools/itertools/pull/436

JosephLenton commented 2 months ago

One can achieve this using chunk_by by counting generations:

let mut current_batch_size = 0;
let mut generation = 0;
let batches = data_to_send
    .into_iter()
    .chunk_by(move |data| {
        let size = data.len();

        if current_batch_size + size > MAX_BATCH_SEND_SIZE {
            current_batch_size = 0;
            generation += 1;
        }

        current_batch_size += size;

        generation
    });

I actually ended up writing this twice due to getting the logic wrong the first time. In practice I would end up putting this into a helper function, so I can wrap that with tests.