huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Apache License 2.0
1.76k stars 105 forks source link

Batched filter inputs? #237

Open stas00 opened 4 days ago

stas00 commented 4 days ago

This is a very cool library! Kudos to the authors!

The Filter API seems to be only working with a single item at a time.

Is there a way to filter in batches? Say you're using a filter that uses an ml model inference. It'd be much more efficient to infer large batches, than 1 item at a time.

I looked around the examples and code in case I have missed it, but I don't seem to find any suggestions that batched input is supported.

The API I think would be similar to the HF Tokenizer where it takes batches and returns batches, so here instead of returning a bool, it'd return a list of bools. If the input is a single sample, return a single bool - if a list, return a list.

Thanks a lot!

guipenedo commented 3 days ago

Great suggestion, thanks! Added support for this in 7ba873fc87086098657e488e7365f8c14aeb4d06 You can now override the BaseFilter's def filter_batch(self, batch: List[Document]) -> List[bool | Tuple[bool, str]]: method and pass batch_size to the BaseFilter's __init__