huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Apache License 2.0
2.05k stars 147 forks source link

[FEATURE] CC Filter #283

Open BramVanroy opened 2 months ago

BramVanroy commented 2 months ago

I was looking at dolma, and they have a nice filter to filter out CreativeCommons data only. It might be worthwhile to add something similar to datatrove, too.

https://github.com/allenai/dolma/blob/64886d9db15bd99acea9e28740ae20a510875dfb/python/dolma/taggers/licenses.py#L19