I have input samples as a sparse matrix of shape (531990 samples, 85765 features).
The size of this matrix in memory is 56KB. The matrix as a numpy array is approximately 340GB.
When i use the MemoryStorage option i run out of memory. This is due to the vec = vec.tocsr() in
unitvec function. The input vectors added by _storevector are scipy.sparse.csr.csr_matrix of shape (85765, 1) as trying to store vectors as scipy.sparse.csr.csr_matrix of shape (1, 85765) gives:
File "nearpy/engine.py", line 96, in store_vector
for bucket_key in lshash.hash_vector(v):
File "nearpy/hashes/randombinaryprojections.py", line 74, in hash_vector
projection = self.normals_csr.dot(v)
File "scipy/sparse/base.py", line 359, in dot
return self * other
File "scipy/sparse/base.py", line 479, in __mul__ raise ValueError('dimension mismatch')
ValueError: dimension mismatch
Removing the vec = vec.tocsr() line solves the problem for matrices of shape (85765, 1) and no extra memory is allocated. This is strange behavior and it might be a scipy bug, but what is the point of the .tocsr() conversion anyway?
I have input samples as a sparse matrix of shape (531990 samples, 85765 features).
The size of this matrix in memory is 56KB. The matrix as a numpy array is approximately 340GB.
When i use the MemoryStorage option i run out of memory. This is due to the vec = vec.tocsr() in unitvec function. The input vectors added by _storevector are scipy.sparse.csr.csr_matrix of shape (85765, 1) as trying to store vectors as scipy.sparse.csr.csr_matrix of shape (1, 85765) gives:
Removing the vec = vec.tocsr() line solves the problem for matrices of shape (85765, 1) and no extra memory is allocated. This is strange behavior and it might be a scipy bug, but what is the point of the .tocsr() conversion anyway?