Gee (1998) "The Cornell TIPSTER Phase III Project" describes a research project using near-duplicate detection --> find publications that describe the methods used
Grave et al. (2018) "Learning Word Vectors for 157 Languages" remove lines with identical Java hash in training data for fastText word embeddings in 157 languages
Lee et al. (2021) "Deduplicating Training Data Makes Language Models Better" find that de-duplicating the C4 training data of a transformer model lowers perplexity on Wiki-40B and one-billion word benchmark and reduces tendencies to emit sequences of 50 or more training tokens by an order of magnitude.
What would happen if we add OpusFilter, or some other (near) duplicate removal tool, to the pipeline?
Literature: