pixelogik / NearPy

Python framework for fast (approximated) nearest neighbour search in large, high-dimensional data sets using different locality-sensitive hashes.
MIT License
763 stars 151 forks source link

Memory Usage #12

Closed kharazi closed 10 years ago

kharazi commented 10 years ago

Hi, I have a problem with memory usage, my vector dim is 2**14, i know it's very large but in #3 you say about turning vectors to sparse matrix this is my redis memory usage when my vectors count is 239:

   # Memory
   used_memory:178110184
   used_memory_human:169.86M
   used_memory_rss:181440512
   used_memory_peak:178248520
   used_memory_peak_human:169.99M
   used_memory_lua:33792
   mem_fragmentation_ratio:1.02
   mem_allocator:jemalloc-3.0.0

Why it's Use about 169M of ram? is it normal for my data size?

pixelogik commented 10 years ago

In dense representation one vector on that space has 16384 coordinates. So 239 vectors take 3915776 coordinates. When using double (mostly 8 bytes) this results in about 31+ MB. But this is just the raw data in memory.

The redis layer however stores vectors and their associated data (most of the time just a string) as JSON strings! So this adds a lot of overhead to this and is very likely the reason for your observation.

Are your vectors sparse? If so, and most values are zero anyway, then I should soon finish a branch that has some additions to use sparse vectors. The redis footprint is with the sparse representation of course much smaller.

kharazi commented 10 years ago

thanks for your description. yes, most values are zero in my vectors. Can you say When do you publish it?

pixelogik commented 10 years ago

Will take a look into the branch to tell....

pixelogik commented 10 years ago

I added support for sparse vectors to processing, hashes and redis storage. Check the last three commits.

For calculations (dot product / projections) NearPy uses CSR format now. For storing into redis it uses COO format. See http://docs.scipy.org/doc/scipy/reference/sparse.html for further details.

So if your vectors are sparse you should use actual sparse vectors/matrices now, as supported by scipy.

pixelogik commented 10 years ago

Your sparse vectors have to have a shape of (n, 1), where n is your feature space dimension. If this is given, you can use NearPy as usual.

Random sparse vectors for example can be generated like this: scipy.sparse.rand(30, 1, density=0.3), so shape in this case is (30, 1).

kharazi commented 10 years ago

Tnx, it's very helpful for me. I had trouble with memory:)