Closed zenogantner closed 5 years ago
PR accepted, thank you Zeno. BTW, computing the hash on a small portion of the data is all but clean solution (I'm criticizing myself... :) )
Yup, and maybe even looking just at the subset, a checksum like MD5 may make more sense than a simple sum: Right now, the part of the function operating on the label gives the same result for all possible combinations of the same label frequency ...
When using a sparse matrix, e.g. scipy.sparse.csr_matrix, we otherwise get the error message: "NotImplementedError: adding a nonzero scalar to a sparse matrix is not supported". With the .sum(), method, it works for both sparse and dense (numpy array) matrices.
This little change allows us to handle much bigger sparse datasets. Memory saving depends on the dataset, I observed a factor of 7 for a dataset with about 5% density.