Ensure `Graph.sparse` is robust enough

martinfleis commented 9 months ago

I've been recently reading some adjacencies with fastparquet instead of pyarrow and noticed that the resulting sparse array is different even though the Series are considered equal. The reason is that the MultiIndex is created differently, leading to different underlying MultiIndex.codes, which, I suppose, somehow changes the conversion to sparse. We need to revise this and ensure that the way pandas creates sparse is robust enough and always yields properly sorted array or deploy our own way of generating sparse (which we had at some point as I wrote it so it will be in the commit history).

this is mostly a note to myself to check this before Graph leaves experimental zone.

ljwolf commented 9 months ago

Is there a reproducible example we can use to test against? Round-tripping was a core design goal...

martinfleis commented 9 months ago

Yes, here:

In [1]: import pandas as pd
   ...: from libpysal.graph import Graph

In [2]: path = "https://github.com/Urban-Analytics-Technology-Platform/demoland-engine/raw/3fba672d08d7d09bdb692c35594064dbcdd69662/data/tyne_and_wear/matrix.parquet"

In [3]: adjacency_pyarrow = pd.read_parquet(path, engine="pyarrow")["weight"]
   ...: adjacency_fastparquet = pd.read_parquet(path, engine="fastparquet")["weight"]

In [4]: pd.testing.assert_series_equal(adjacency_pyarrow, adjacency_fastparquet)

In [5]: graph_pyarrow = Graph(adjacency_pyarrow, "r", is_sorted=True)
   ...: graph_fastparquet = Graph(adjacency_fastparquet, "r", is_sorted=True)

In [6]: graph_pyarrow.equals(graph_fastparquet)
Out[6]: True

In [7]: sparse_pyarrow = graph_pyarrow.sparse
   ...: sparse_fastparquet = graph_fastparquet.sparse

In [8]: (sparse_pyarrow == sparse_fastparquet).todense().all()
Out[8]: False

In [9]: sparse_pyarrow.todense()[0][0] # <- this is correct
Out[9]: 0.0

In [10]: sparse_fastparquet.todense()[0][0] # <- this is wrong
Out[10]: 0.007518796992481203

I haven't looked into pandas source code, but my guess is that it is because:

In [11]: (adjacency_pyarrow.index.codes[1] == adjacency_fastparquet.index.codes[1]).all()
Out[11]: False

In [12]: adjacency_fastparquet.index.codes
Out[12]: 
[array([   0,    0,    0, ..., 3794, 3794, 3794]),
 array([   0,    1,    2, ..., 3792, 2300, 3406])]

In [13]: adjacency_pyarrow.index.codes
Out[13]: FrozenList([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...], [3, 4, 14, 26, 28, 55, 59, 61, 78, 100, 104, 108, 115, 129, 131, 132, 138, 139, 150, 153, 155, 157, 158, 159, 160, 166, 198, 199, 204, 243, 245, 288, 290, 292, 295, 297, 298, 300, 302, 304, 430, 457, 461, 464, 465, 556, 559, 560, 561, 562, 563, 564, 565, 566, 568, 569, 601, 604, 607, 623, 625, 628, 629, 658, 688, 689, 690, 691, 692, 693, 696, 743, 744, 747, 752, 757, 761, 765, 767, 787, 841, 849, 853, 857, 864, 906, 916, 923, 924, 1120, 1124, 1164, 1282, 1291, 1353, 1357, 1565, 1566, 1572, 1625, ...]])

Therefore, my trust in pandas conversion to sparse is a bit lower than it was before. Because everything else looks exactly the same. E.g.:

In [14]: (graph_pyarrow.unique_ids == graph_fastparquet.unique_ids).all()
Out[14]: True

martinfleis commented 9 months ago

The sparse representation seems to be inherently dependent on the way MultiIndex was originally built, which is something we can't always control, hence should figure out more robust way of conversion, imo.

knaaptime commented 9 months ago

well it seems like the conversion to sparse works ok, but the creation of the multiindex is different between the two readers? Once you're moving from dataframe to sparse, the parquet backend is irrelevant.

So since its im there's something different about the way the codes are generated from the multiindex (probably something to do with the way fastparquet tries to parallelize reads?)

if we dont depend on the index, then we need to depend on read-order of the rows (since we default to a range index when no id column is available), so can we guarantee the rows always come back in the same order between the two>

knaaptime commented 9 months ago

The sparse representation seems to be inherently dependent on the way MultiIndex was originally built, which is something we can't always control, hence should figure out more robust way of conversion, imo.

im game to move to whatever works best, but im not sure we have the real source of the problem yet. Even with the current W architecture, we're dependent on pandas.indices of some kind, so if those dont come back in the same order after IO, then we're in trouble. Even if you had your own graph ordering that doesnt come from a range index, we expect the order is preserved after IO

...this is also why GAL and GWT exist in the first place. In the early days you couldnt guarantee ordering or index placement, so those files ended up defining a strict spec. I raise that only to say that graphs are special, and we might be forced to make some stricter decisions about how they are stored.

If this does end up being an IO issue, i wonder if there's something we could do in the parquet metedata that's specific to a graph? we could also reset the index on write so that you just have regular columns, and you make those a multiindex on read?

knaaptime commented 9 months ago

Even with the current W architecture, we're dependent on pandas.indices of some kind, so if those dont come back in the same order after IO, then we're in trouble. Even if you had your own graph ordering that doesnt come from a range index, we expect the order is preserved after IO

but also... even if your Graph is static, but you redo your analysis and your geodataframe comes back reordered from fastparquet, it wont be aligned with the data any longer, so i think we need to raise the complexities of using a stored Graph

martinfleis commented 9 months ago

No no, this is misunderstanding. The order of the index is equal in both cases. Fastparquet does not reorder it. What is affected is some underlying pandas representation that influences conversion to sparse. It could also be a bug in pandas we just found, I need to look further. But I'd be just more comfortable if I knew how the conversion to sparse works and had a full control over it, which is not the case with the current method.

martinfleis commented 9 months ago

Using this custom code returns exactly the same behaviour as above

        return sparse.coo_array(
            (
                self._adjacency,
                (
                    self._adjacency.index.codes[0],
                    self._adjacency.index.codes[1],
                ),
            ),
            shape=(self.n, self.n),
        )

What fixes it though, is passing is_sorted=False, ensuring we don't skip our internal treatment of the index and canonical order. I just thought that since I know where this adjacency list is coming from, I can just bypass sorting here but that doesn't seem to be the case.

My suggestion is to expand the docstring of Graph.__init__ saying that is_sorted shall not be used unless you really really know that the index treatment has been done on the pandas object you are trying to use and be generally cautious about using is_sorted=True in internal codebase. With pyarrow, IO with is_sorted=True seems to be fine and the rest of the places where is_sorted=True is used are also okay imo.

Note that the issue shown above was not caused by libpysal's functionality (e.g. our IO) but by my attempt to go around it, to be able to deserialise graph in WebAssembly which does not have pyarrow but has fastparquet. It seems that the way of doing that is to trigger sorting.

pysal / libpysal

Ensure `Graph.sparse` is robust enough #687