Stop combining documents with the same identifier during pre-processing

karlhigley / lexrank-summarizer

A Spark-based LexRank extractive summarizer for text documents

MIT License

19 stars 4 forks source link

Stop combining documents with the same identifier during pre-processing #34

Closed karlhigley closed 8 years ago

karlhigley commented 8 years ago

Now that the boilerplate filtering should detect exact duplicates across all documents and near duplicates within each document, this bit of pre-processing no longer makes sense. Concatenating documents in this way mostly leads to problems in the sentence segmentation, and doesn't have a concrete benefit (given the boilerplate filtering).