karlhigley / lexrank-summarizer

A Spark-based LexRank extractive summarizer for text documents
MIT License
19 stars 4 forks source link

Avoid creating graph edges between different documents #4

Closed karlhigley closed 9 years ago

karlhigley commented 9 years ago

This series of changes uses the document id to avoid creating edges between sentences in different documents. In order to do that, the document id has to be maintained through out the featurization code, so it's carried along in the SentenceTokens and SentenceFeatures case classes. The document id is then used to pair up sentences from the same document, so that their similarity can then be computed and compared against the threshold for creating graph edges (as before).