Option to keep documents that can't be identified

In the case of mC4 (also called c4/multilingual) The undetermined portion('und') for mC4 3.1 this is when according to their langID cld3, the highest confidence for a language is <0.95. Since, Ungoliant works differently and with different langID tools and models (fasttext, lid176.bin but I hope to petition to change this to lid218) specific processes and cutoffs might have to be different. Seeing how ungoliant records per sentence confidence score, many options could be explored. The current average confidence weighted per byte seems a very good compromise especially compared to simple mean.

In any case this would be very useful. The 'und' portion of mC4 is second only to english in quantity or byte size and rife for opportunities where humans can get involved to salvage data or understand langID behaviors.

I and some others are actively doing such salvaging and here is an example of such salvaging efforts.

oscar-project / ungoliant

Option to keep documents that can't be identified #88