huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Apache License 2.0
1.97k stars 139 forks source link

Error in Stats #287

Closed crisgarrillo closed 1 week ago

crisgarrillo commented 2 weeks ago

Hi i'm trying to execute a pipeline of stats. Following your example summary_stats.py, an error occurred :

AttributeError: 'TLDExtract' object has no attribute 'extract_str'. Did you mean: '_extractor'?

I tried both with parquet file than Jsonl. I tried on commonly used dataset like culturax or redpajama...

Any idea or suggestion is very appreciated.

Thanks Chris