oscar-project / ungoliant

:spider: The pipeline for the OSCAR corpus
https://oscar-corpus.com
Apache License 2.0
161 stars 14 forks source link

Option to keep documents that can't be identified #88

Open Uinelj opened 1 year ago

Uinelj commented 1 year ago

We could add an option that enables keeping documents that are not identifiable (where the classifier can't infer a document language), for further inspection.

chris-ha458 commented 1 year ago

In the case of mC4 (also called c4/multilingual) The undetermined portion('und') for mC4 3.1 this is when according to their langID cld3, the highest confidence for a language is <0.95. Since, Ungoliant works differently and with different langID tools and models (fasttext, lid176.bin but I hope to petition to change this to lid218) specific processes and cutoffs might have to be different. Seeing how ungoliant records per sentence confidence score, many options could be explored. The current average confidence weighted per byte seems a very good compromise especially compared to simple mean.

In any case this would be very useful. The 'und' portion of mC4 is second only to english in quantity or byte size and rife for opportunities where humans can get involved to salvage data or understand langID behaviors.

I and some others are actively doing such salvaging and here is an example of such salvaging efforts.