The way to view the LSH-based bucketing is as creating a sparser (not smaller)
matrix. Based on that understanding, this change uses the total number of
sentences as the column size and the largest feature index output by the TF-IDF
transformers as the row size of each matrix created from bucketed sentences.
Using these dimensions avoids having to compute the size of each matrix, and
avoids errors when the column indices are larger than the number of columns.
(The only apparent downside is using some extra memory when computing the column
magnitudes.)
The way to view the LSH-based bucketing is as creating a sparser (not smaller) matrix. Based on that understanding, this change uses the total number of sentences as the column size and the largest feature index output by the TF-IDF transformers as the row size of each matrix created from bucketed sentences. Using these dimensions avoids having to compute the size of each matrix, and avoids errors when the column indices are larger than the number of columns.
(The only apparent downside is using some extra memory when computing the column magnitudes.)