Open BramVanroy opened 2 months ago
I was looking at dolma, and they have a nice filter to filter out CreativeCommons data only. It might be worthwhile to add something similar to datatrove, too.
https://github.com/allenai/dolma/blob/64886d9db15bd99acea9e28740ae20a510875dfb/python/dolma/taggers/licenses.py#L19
I was looking at dolma, and they have a nice filter to filter out CreativeCommons data only. It might be worthwhile to add something similar to datatrove, too.
https://github.com/allenai/dolma/blob/64886d9db15bd99acea9e28740ae20a510875dfb/python/dolma/taggers/licenses.py#L19