pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.31k stars 17.8k forks source link

BUG: SparseArray is not implemented as index first #45126

Open bdrum opened 2 years ago

bdrum commented 2 years ago

Pandas version checks

Reproducible Example

import pandas as pd
import numpy as np

s1 = pd.arrays.SparseArray([1,0,-1,np.nan], fill_value=np.nan)
s1
# [1.0, 0.0, -1.0, nan]
# Fill: nan
# IntIndex
# Indices: array([0, 1, 2])
s2 = s1 + 1

# wrong SA after comparison with other SA
s1 > s2
# [False, False, False, False]
# Fill: False
# IntIndex
# Indices: array([0, 1, 2])
# should be 
# Indices: array([])

# wrong SA after comparison with other ndarray
a = np.array([0,1,2,3])
s1 > a

# [True, False, False, False]
# Fill: False
# IntIndex
# Indices: array([0, 1, 2, 3])
# should be
# Indices: array([0])

# wrong SA after filling na values
s1.fillna(-1)

# [1.0, 0.0, -1.0, -1]
# Fill: -1
# IntIndex
# Indices: array([0, 1, 2])
# should be
# Indices: array([0, 1])

# wrong SA in case of changing the fill_value
s1.fill_value = -1
s1
# [1.0, 0.0, -1.0, -1]
# Fill: -1
# IntIndex
# Indices: array([0, 1, 2])
# should be 
# Indices: array([0, 1])

# wrong SA in case of arithmetic operations:
sa = pd.arrays.SparseArray([0,0,np.nan,-2,-1,4,2,3,0,0],fill_value=0)
sa
# [0, 0, nan, -2.0, -1.0, 4.0, 2.0, 3.0, 0, 0]
# Fill: 0
# IntIndex
# Indices: array([2, 3, 4, 5, 6, 7])

sa + 1
# [1.0, 1.0, nan, -1.0, 0.0, 5.0, 3.0, 4.0, 1.0, 1.0]
# Fill: 1.0
# IntIndex
# Indices: array([2, 3, 4, 5, 6, 7])

# Indices of result are wrong. It should be:
pd.arrays.SparseArray([1.0, 1.0, np.nan, -1.0, 0.0, 5.0, 3.0, 4.0, 1.0, 1.0], fill_value=0)
# [1.0, 1.0, nan, -1.0, 0, 5.0, 3.0, 4.0, 1.0, 1.0]
# Fill: 0
# IntIndex
# Indices: array([0, 1, 2, 3, 5, 6, 7, 8, 9])

Issue Description

In current implementation of SparseArray class usually doesn't work with own indices.

When I has starting to work with SparseArray in pandas, I've noticed, that many operations (unary, binary and some methods) work with SparseArray the same as basic ndarray. In some case it uses explicit (to_dense) cast, in the other case it apply operations only to internal sp_values array and doesn't check the indices. As a result we have wrong SparseArray instances (see examples above). Pay attention that printed arrays are correct, but indices are not.

Anyway in both case the logic doesn't fit the nature of Sparse array.

I suggest to discuss new private api that will be based on sparse logic, because only this way will lead us to effective structure that Sparse Array represents.

There are few open questions about the API:

Links for study:

phofl commented 2 years ago

Please post actual expected outputs, they are just the same as above

bdrum commented 2 years ago

Not the same, pls see indices.

phofl commented 2 years ago

Ah got you, could you highlight them somehow?

bdrum commented 2 years ago

Yeah, looks like not so very demonstrative example. I will change to some with well seen that contains just one-two values .

bdrum commented 2 years ago

What about SparseDType and fill_value. I dup my comment from mentioned issue

Now we have situation when SparseDType keeps both the type of non zero elements and the VALUE of fill_value. Why do we need keep fill_value in sparsedtype? I think this is absolutely no reason to do this.

The crucial thing of sparse array (as I understand, but I'm not a specialist in this topic) is only non zero values and indices. There is no matter what exactly fill_value we have, zero, nan, or another because we keep in memory only non zero elements array and their positions i.e. indices.

In Julia, sparse vectors have the type SparseVector{Tv,Ti} where Tv is the type of the stored values and Ti the integer type for the indices. The internal representation is as follows:

struct SparseVector{Tv,Ti<:Integer} <: AbstractSparseVector{Tv,Ti}
    n::Int              # Length of the sparse vector
    nzind::Vector{Ti}   # Indices of stored values
    nzval::Vector{Tv}   # Stored values, typically nonzeros
end

Looks like more rational to go on this way.

Sure this means dramatically changes in API, but anyway now SparseArray in pandas doesn't looks like working mechanism. It doesn't support the most usable formats (CRS or CCS).

@jreback I suggest to change BUG label to API for this issue.

Also as I see, one of the main idea of series and this is the crucial part in comparison with just an array is indices. But in SparseArray indices is also crucial thing. So if we will marry Series and SparseArray we will get SparseSeries that looks more natural for me:)