pixelogik / NearPy

Python framework for fast (approximated) nearest neighbour search in large, high-dimensional data sets using different locality-sensitive hashes.
MIT License
759 stars 152 forks source link

Calling store_vector with MemoryStorage on scipy.sparse.csr_matrix allocates memory when it shouldn't. #93

Open Apkar029 opened 3 years ago

Apkar029 commented 3 years ago

I have input samples as a sparse matrix of shape (531990 samples, 85765 features).

The size of this matrix in memory is 56KB. The matrix as a numpy array is approximately 340GB.

When i use the MemoryStorage option i run out of memory. This is due to the vec = vec.tocsr() in unitvec function. The input vectors added by _storevector are scipy.sparse.csr.csr_matrix of shape (85765, 1) as trying to store vectors as scipy.sparse.csr.csr_matrix of shape (1, 85765) gives:

File "nearpy/engine.py", line 96, in store_vector
  for bucket_key in lshash.hash_vector(v):
File "nearpy/hashes/randombinaryprojections.py", line 74, in hash_vector
  projection = self.normals_csr.dot(v)
File "scipy/sparse/base.py", line 359, in dot
  return self * other
File "scipy/sparse/base.py", line 479, in __mul__ raise ValueError('dimension mismatch')
ValueError: dimension mismatch

Removing the vec = vec.tocsr() line solves the problem for matrices of shape (85765, 1) and no extra memory is allocated. This is strange behavior and it might be a scipy bug, but what is the point of the .tocsr() conversion anyway?