Open Uinelj opened 2 years ago
zstd
instead of gzip
. Check how we can use dictionnaries. Check if we can do multipart?We have the sampling already finished? no
Sampling is related to oscar-tools
and has already been merged: https://github.com/oscar-corpus/oscar-tools/pull/23
This issue tracks the changes and new features of the output corpus
This issue serves as a discussion/checklist elaboration for the next OSCAR version to come.
We shoud aim to fix existing bugs/problems as well as adding potential features.
Issues
Features
adult
is too strong for something we know has a lot of false positives. Also, with the inclusion of model based filtering, we'll have to find a way to specify annotation source.