Closed acycliq closed 5 months ago
Hello, this can be explained by the fact that the storage of indices in COO
is np.intp
instead of np.int32
, allowing for larger sparse arrays to be stored. The following code snippet illustrates that:
>>> a = np.random.default_rng().random((100, 100, 100))
>>> a[a < 0.9] = 0
>>> s_list = [scipy.sparse.coo_matrix(d) for d in a]
>>> s = sparse.COO.from_numpy(a)
>>> s.nbytes / (1024**2)
3.03948974609375
>>> sum(d.data.nbytes + d.row.nbytes + d.col.nbytes for d in s_list) / (1024**2)
1.519744873046875
>>> s.coords.dtype
dtype('int64')
>>> (s.coords.dtype, s.data.dtype)
(dtype('int64'), dtype('float64'))
>>> (s_list[0].row.dtype, s_list[0].col.dtype, s_list[0].data.dtype)
(dtype('int32'), dtype('int32'), dtype('float64'))
We did have routines to compress the dtype of s.coords
to the smallest possible for the array, but in many cases this led to overflows and bugs, due to which we reverted it to being np.intp
only. As part of the work under #618, we are planning to let the user have more control over this in the compiler backend.
Description I was wondering if there is any insight as to why a
sparse.COO
seems to consume almost twice as much memory compared to a list ofscipy.sparse.coo_matrices
. Maybe I am missing something..... For example the code below is copied from thegetting started
page of the documentation. I have just extended it to include a case where the dense 3D numpy array is converted into a list of coo_matrices and calculated the memory footprintExample Code
The size of of the dense array is 7.6MB
x.nbytes/(1024**2) = 7.629MB
The size of the COO is 3MB
s.nbytes/(1024**2) = 3.038MB
Make now a list of scipy.sparse.coo_matrices
The size of the list of 1.5MB
sum(nbytes_list)/(1024**2) = 1.519MB
In fact this can be simplified even further, compare a 2D COO with a scipy.sparse.coo_matrix (instead of a 3D COO vs a litst). For example:
sparse.COO(x[0]).nbytes = 23448
butcoo_matrix(x[0]).data.nbytes + coo_matrix(x[0]).row.nbytes + coo_matrix(x[0]).col.nbytes = 15632