Improve tokenization to reduce dimensionality

karlhigley / lexrank-summarizer

A Spark-based LexRank extractive summarizer for text documents

MIT License

19 stars 4 forks source link

Improve tokenization to reduce dimensionality #19

Closed karlhigley closed 9 years ago

karlhigley commented 9 years ago

This removes tokens that only occur once (which are irrelevant to computing cosine similarity) and strips out non-alphabetic characters (which can lead to double-counting essentially the same token).