huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Apache License 2.0
1.97k stars 139 forks source link

Add token and char count to histogram stats #251

Closed guipenedo closed 2 months ago

guipenedo commented 2 months ago

Adds 2 new stats per original stat when histogram stats are collected: