pydata / sparse

Sparse multi-dimensional arrays for the PyData ecosystem
https://sparse.pydata.org
BSD 3-Clause "New" or "Revised" License
606 stars 127 forks source link

Memory usage - coords waste #335

Open mbarbry opened 4 years ago

mbarbry commented 4 years ago

Dear developers,

Description In my code, I'm using sparse for handling large data ( > 10 GB). I noticed a larger memory usage by the sparse library than I expected. Comparing 2D matrix with scipy.sparse I realized that sparse is using a significantly larger amount of memory than scipy.sparse. Below you can find the memory consumption of the small example code included at the bottom (obtained with the memory_profiler library)

Line #    Mem usage    Increment   Line Contents
================================================
 7     99.5 MiB     99.5 MiB   @profile
 8                             def check_conv(N1, N2, N3):
 9    226.4 MiB    126.9 MiB       A = sp.random(N1, N2, density=0.12, format="coo")
10                             
11    447.0 MiB    220.6 MiB       B = sparse.COO.from_scipy_sparse(A)
12    636.3 MiB    189.3 MiB       return B.reshape((N3, N2, N2))

We see a usage of 220 MB by sparse.COO while scipy.sparse uses only 127 MB. Investigating the memory usage in sparse.COO, I found a large amount of memory used by the lines

246    415.6 MiB    126.1 MiB           self.coords = self.coords.astype(np.intp, copy=False)

and

276    510.1 MiB     94.3 MiB               self._sort_indices()

If I comment line 246 in the file sparse/_coo/core.py then the memory usage is significantly smaller.

Line #    Mem usage    Increment   Line Contents
================================================
 7     99.2 MiB     99.2 MiB   @profile
 8                             def check_conv(N1, N2, N3):
 9    226.3 MiB    127.1 MiB       A = sp.random(N1, N2, density=0.12, format="coo")
10                             
11    383.8 MiB    157.5 MiB       B = sparse.COO.from_scipy_sparse(A)
12    573.1 MiB    189.2 MiB       return B.reshape((N3, N2, N2))

A gain of around 60 MB. My question is, why line 246 in sparse/_coo/core.py seems to copy the memory, while copy=False and how can I avoid it? Also, do there is a way to avoid the sorting of index in line 276 when converting the matrix from scipy.sparse?

Example Code

from __future__ import division
import numpy as np
import scipy.sparse as sp
import sparse
from memory_profiler import profile

@profile
def check_conv(N1, N2, N3):
    A = sp.random(N1, N2, density=0.12, format="coo")

    B = sparse.COO.from_scipy_sparse(A)
    return B.reshape((N3, N2, N2))

check_conv(453264, 152, 2982)
hameerabbasi commented 4 years ago

Hello, you can pass the sorted=True flag to avoid the sorting of contents, and the has_duplicates=False to avoid deduplication. Beware that there will be issues if you do this with coordinates which aren’t sorted or have duplicates.

Also, the higher memory usage is due to the format. We use COO which usually has lower compression efficiency than CSR, except hypersparse arrays.

daletovar commented 4 years ago

I think the main factor here is that np.intp typically upcasts 32 bit ints to 64 bit ints. @mbarbry, if you run your code example and check the dtypes of the coordinate arrays I think you'll see A.row.dtype is going to be dtype('int32') whereas B.coords.dtype is dtype('int64'). This is related to #249.

mbarbry commented 4 years ago

Thank you for your answers. What @daletovar describes seems to be the issue. So from what I read in #249 , there is no actual fix for such situations?

daletovar commented 4 years ago

I don't know what kind of bugs occurred exactly when using other dtypes to store coordinates (@hameerabbasi might be able to answer this), but you could perhaps try commenting out the conversion. Depending on what you're trying to do the GCXS format could be useful. You would have to clone from github to use it.

hameerabbasi commented 4 years ago

Yes, it was complex. We had overflows, and lots of them in different places. 🤷‍♂️ I gave up at some point and moved to np.int64.