huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Apache License 2.0
2.05k stars 147 forks source link

Fastwarc reader #182

Open jordane95 opened 6 months ago

jordane95 commented 6 months ago

Can we add a new warc reader using the fastwarc?

It is said to be much more efficient than warcio

maxidl commented 5 months ago

Including fastwarc would be nice. However, in the current text extraction pipeline for fineweb, the warc reader is not a bottleneck (<5% of runtime on my machine, while trafilatura is 95% of runtime). Of course, this might differ for other datasets.