Open jordane95 opened 6 months ago
Including fastwarc would be nice. However, in the current text extraction pipeline for fineweb, the warc reader is not a bottleneck (<5% of runtime on my machine, while trafilatura is 95% of runtime). Of course, this might differ for other datasets.
Can we add a new warc reader using the fastwarc?
It is said to be much more efficient than warcio