karlhigley / lexrank-summarizer

A Spark-based LexRank extractive summarizer for text documents
MIT License
19 stars 4 forks source link

Precompute size of sparsified matrices (instead of auto-computation) #20

Closed karlhigley closed 9 years ago

karlhigley commented 9 years ago

The way to view the LSH-based bucketing is as creating a sparser (not smaller) matrix. Based on that understanding, this change uses the total number of sentences as the column size and the largest feature index output by the TF-IDF transformers as the row size of each matrix created from bucketed sentences. Using these dimensions avoids having to compute the size of each matrix, and avoids errors when the column indices are larger than the number of columns.

(The only apparent downside is using some extra memory when computing the column magnitudes.)