karlhigley / lexrank-summarizer

A Spark-based LexRank extractive summarizer for text documents
MIT License
19 stars 4 forks source link

Switch from a stopword list to dynamically identified stopwords #37

Closed karlhigley closed 8 years ago

karlhigley commented 8 years ago

This PR switches from a static list of stopwords provided in a file to a dynamic list identified from the corpus being processed. The stopwords list is now generated by selecting a configurable number of terms with the lowest IDF. Additionally, the stopwords are filtered out when featurizing sentences, rather than in the document segmentation and tokenization step.

Resolves #31.