pysal / libpysal

Core components of Python Spatial Analysis Library
http://pysal.org/libpysal
Other
249 stars 79 forks source link

Performant weights IO #350

Open martinfleis opened 3 years ago

martinfleis commented 3 years ago

I am trying to figure out what is the best way to store W in a file and I am not happy with either of those I tried.

  1. using gal and W.to_file()
    • This is nicely performant in both reading and writing, but it does not preserve dtype. Which means that my integer values are read as strings.
  2. using parquet to store adjlist
    • This is way slower than gal in both reading and writing (see below).

Do you have an idea what are the other options, which are both performant and preserve dtype? If they can preserve islands, that is a plus (but not necessary atm).

See the performance comparison below. df has about 110k polygons.

Screenshot 2020-10-12 at 14 01 47

Would it make sense to try to serialise W into a custom parquet file, instead of going through adjlist? Not sure how, just an idea.

ljwolf commented 3 years ago

Hey, yeah, very good question. We should have native support for .mtx format, which may not convert the dtype? And, you're referring to the fact that the identifier dtype changes, right?

The to/from_adjlist constructors/destructors could definitely be optimized directly; I'd intended them being single-use for an analysis and the performance characteristics of pandas definitely has changed since they were written... I also think w.sparse.nonzero() and w.sparse.data are sufficient to construct the adjacency list, and that may be faster than the set operations used currently.

Alternatively, what about using the scipy.io.mmwrite or scipy.sparse.save_npz to save the sparse matrix directly?

martinfleis commented 3 years ago

And, you're referring to the fact that the identifier dtype changes, right?

Yes. w.neighbors[0] does not work after gal read, you have to pass it as a string w.neighbors['0'].

The winner seems to be npz! It is the fastest one to save, by far the smallest one and reading is a bit slower than gal but better than other options.

I think that we should have built-in IO for these, but maybe npz could be preferred over mtx, considering the results below.

Screenshot 2020-10-12 at 16 40 49

ljwolf commented 3 years ago

Great! Yes, would be very easy to bolt that onto the from_file() based on the file path or format argument, but adding it into the actual libpysal.io.FileIO.FileIO metaclass may be challenging.

In my notes, this has come up before in developer calls & a SciPy sprint back in 2015, so a divert-read/write-to-scipy contribution would be straight forward and welcomed, I think!

martinfleis commented 3 years ago

An update on this. I've found an issue with IO based on scipy sparse array. When we create sparse array here it does not necessarily preserve dtype.

https://github.com/pysal/libpysal/blob/533229c2f5ae20fa1bda57dc723195719752a047/libpysal/weights/weights.py#L406

So in order to make it work as intended, we have to infer the dtype of ids and pass it to sparse constructor. But that breaks other things, so this may be a bit more complicated that a simple scipy.sparse.save_npz(path, self.sparse) and w = WSP(scipy.sparse.load_npz(path)).to_W() as I have just done here: https://github.com/pysal/libpysal/compare/master...martinfleis:perf_io