huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Apache License 2.0
1.97k stars 139 forks source link

Update MinhashConfig with detailed settings and add default language … #252

Closed justHungryMan closed 1 month ago

justHungryMan commented 2 months ago

While utilizing fuzzy-dedup for Korean documents, I recalled updates related to multi-language tokenization.

Key Changes:

This updates led to the explicit inclusion of a language option in our example code, ensuring that others can easily see how to apply this setting.

hynky1999 commented 1 month ago

Good idea. I decided to also migrate rest of the examples in this PR.

Thanks!