Now that the boilerplate filtering should detect exact duplicates across all
documents and near duplicates within each document, this bit of pre-processing
no longer makes sense. Concatenating documents in this way mostly leads to
problems in the sentence segmentation, and doesn't have a concrete benefit
(given the boilerplate filtering).
Now that the boilerplate filtering should detect exact duplicates across all documents and near duplicates within each document, this bit of pre-processing no longer makes sense. Concatenating documents in this way mostly leads to problems in the sentence segmentation, and doesn't have a concrete benefit (given the boilerplate filtering).