Open martinfleis opened 3 years ago
Hey, yeah, very good question. We should have native support for .mtx
format, which may not convert the dtype? And, you're referring to the fact that the identifier dtype changes, right?
The to/from_adjlist
constructors/destructors could definitely be optimized directly; I'd intended them being single-use for an analysis and the performance characteristics of pandas definitely has changed since they were written... I also think w.sparse.nonzero()
and w.sparse.data
are sufficient to construct the adjacency list, and that may be faster than the set operations used currently.
Alternatively, what about using the scipy.io.mmwrite
or scipy.sparse.save_npz
to save the sparse matrix directly?
And, you're referring to the fact that the identifier dtype changes, right?
Yes. w.neighbors[0]
does not work after gal
read, you have to pass it as a string w.neighbors['0']
.
The winner seems to be npz
! It is the fastest one to save, by far the smallest one and reading is a bit slower than gal
but better than other options.
I think that we should have built-in IO for these, but maybe npz
could be preferred over mtx
, considering the results below.
Great! Yes, would be very easy to bolt that onto the from_file()
based on the file path or format argument, but adding it into the actual libpysal.io.FileIO.FileIO
metaclass may be challenging.
In my notes, this has come up before in developer calls & a SciPy sprint back in 2015, so a divert-read/write-to-scipy contribution would be straight forward and welcomed, I think!
An update on this. I've found an issue with IO based on scipy sparse array. When we create sparse array here it does not necessarily preserve dtype
.
So in order to make it work as intended, we have to infer the dtype of ids and pass it to sparse constructor. But that breaks other things, so this may be a bit more complicated that a simple scipy.sparse.save_npz(path, self.sparse)
and w = WSP(scipy.sparse.load_npz(path)).to_W()
as I have just done here: https://github.com/pysal/libpysal/compare/master...martinfleis:perf_io
I am trying to figure out what is the best way to store
W
in a file and I am not happy with either of those I tried.gal
andW.to_file()
parquet
to store adjlistgal
in both reading and writing (see below).Do you have an idea what are the other options, which are both performant and preserve dtype? If they can preserve islands, that is a plus (but not necessary atm).
See the performance comparison below.
df
has about 110k polygons.Would it make sense to try to serialise W into a custom parquet file, instead of going through adjlist? Not sure how, just an idea.