Memory usage compared to scipy.sparse.coo_matrix

pydata / sparse

Sparse multi-dimensional arrays for the PyData ecosystem

BSD 3-Clause "New" or "Revised" License

585 stars 125 forks source link

Description I was wondering if there is any insight as to why a sparse.COO seems to consume almost twice as much memory compared to a list of scipy.sparse.coo_matrices. Maybe I am missing something..... For example the code below is copied from the getting started page of the documentation. I have just extended it to include a case where the dense 3D numpy array is converted into a list of coo_matrices and calculated the memory footprint

Example Code

import numpy as np
import sparse
from scipy.sparse import coo_matrix

x = np.random.random((100, 100, 100))
x[x < 0.9] = 0  # fill most of the array with zeros

s = sparse.COO(x)  # convert to sparse array

The size of of the dense array is 7.6MB x.nbytes/(1024**2) = 7.629MB

The size of the COO is 3MB s.nbytes/(1024**2) = 3.038MB

Make now a list of scipy.sparse.coo_matrices

sp = [coo_matrix(d) for d in x]
nbytes_list = [d.data.nbytes + d.row.nbytes + d.col.nbytes for d in sp]

The size of the list of 1.5MB sum(nbytes_list)/(1024**2) = 1.519MB

In fact this can be simplified even further, compare a 2D COO with a scipy.sparse.coo_matrix (instead of a 3D COO vs a litst). For example: sparse.COO(x[0]).nbytes = 23448 but coo_matrix(x[0]).data.nbytes + coo_matrix(x[0]).row.nbytes + coo_matrix(x[0]).col.nbytes = 15632

>>> a = np.random.default_rng().random((100, 100, 100)) >>> a[a < 0.9] = 0 >>> s_list = [scipy.sparse.coo_matrix(d) for d in a] >>> s = sparse.COO.from_numpy(a) >>> s.nbytes / (1024**2) 3.03948974609375 >>> sum(d.data.nbytes + d.row.nbytes + d.col.nbytes for d in s_list) / (1024**2) 1.519744873046875 >>> s.coords.dtype dtype('int64') >>> (s.coords.dtype, s.data.dtype) (dtype('int64'), dtype('float64')) >>> (s_list[0].row.dtype, s_list[0].col.dtype, s_list[0].data.dtype) (dtype('int32'), dtype('int32'), dtype('float64'))

pydata / sparse

Memory usage compared to scipy.sparse.coo_matrix #659