Effect of filtering (near) duplicates

What would happen if we add OpusFilter, or some other (near) duplicate removal tool, to the pipeline?

Literature:

Broder (1997) On the resemblance and containment of documents
Gee (1998) "The Cornell TIPSTER Phase III Project" describes a research project using near-duplicate detection --> find publications that describe the methods used
Cooper, James W.; Coden, Anni R. und Brown, Eric W. (2002): A Novel Method for Detecting Similar Documents
Theobald et al. (2008) SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections
Tateishi and Kusui (2008) Fast Duplicated Documents Detection using Multi-level Prefix-filter
Pomikálek and Rychlý (2008) Detecting Co-Derivative Documents in Large Text Collections
Wu et al. (2011) Efficient Near-Duplicate Detection for Q&A Forum
Pecina et al. (2011) "Towards Using Web-Crawled Data for Domain Adaptation in Statistical Machine Translation" apply the SpotSigs algorithm (Theobald et al., 2008) to monolingual training data of SMT models
Papavassiliou et al. (2013) "A modular open-source focused crawler for mining monolingual and bilingual corpora from the web" detect duplicate documents representing each document as a lists of the MD5 hashes of its paragraph
Schofield et al. (2018) "Quantifying the Effects of Text Duplication on Semantic Models" investigate the effect of duplicates by artificially creating duplicate text
Grave et al. (2018) "Learning Word Vectors for 157 Languages" remove lines with identical Java hash in training data for fastText word embeddings in 157 languages
Poddar et al. (2019) Train One Get One Free: Partially Supervised Neural Network for Bug Report Duplicate Detection and Clustering
Ortiz Suárez et al. (2019) "Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures" use runiq to deduplicate OSCAR
Rodier and Carter (2020) Online Near-Duplicate Detection of News Articles
Gyawali et al. (2020) Deduplication of Scholarly Documents using Locality Sensitive Hashing and Word Embeddings
Miletic et al. (2020) "Collecting Tweets to Investigate Regional Variation in Canadian English" detect near-duplicate tweets using hierarchical clustering
Lee et al. (2021) "Deduplicating Training Data Makes Language Models Better" find that de-duplicating the C4 training data of a transformer model lowers perplexity on Wiki-40B and one-billion word benchmark and reduces tendencies to emit sequences of 50 or more training tokens by an order of magnitude.
Armengol-Estapé et al (2021) "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? A Comprehensive Assessment for Catalan" use de-duplication for a Catalan BERT model. Effect not reported.
Kim et al. (2021) "Cleaning Dirty Books: Post-OCR Processing for Previously Scanned Texts" identify duplicate books with differences in OCR errors and title changes
Anonymous (2021, ARR November submission) "BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla" deduplicate a Bangla training corpus for BERT. Method and effect are not reported.

jbrry / Irish-BERT

Effect of filtering (near) duplicates #73