pysal / pointpats

Planar Point Pattern Analysis in PySAL
https://pysal.org/pointpats/
BSD 3-Clause "New" or "Revised" License
83 stars 26 forks source link

Kenv is very memory inefficient #52

Closed ljwolf closed 3 years ago

ljwolf commented 4 years ago

Running Kenv on the data as in #51, I exhaust 16GB of ram & peg my CPU at n=419.

I believe this is because of the memory pressure exerted in kdtree.query_pairs

https://github.com/pysal/pointpats/blob/bceba7f70deb2a27ac1cf08fe5c7013926714431/pointpats/distance_statistics.py#L486

All told, this is a very inefficient way to compute the K function. Each iteration has to re-discover the neighbors from every previous iteration. At the maximum distance, you have to repeatedly re-discover nearly the entire dataset. Doing this for simulations means this is done for thousands of simulations.

The documentation for scipy.spatial.KDTree explicitly warns against using it for this purpose:

The tree also supports all-neighbors queries, both with arrays of points and with other kd-trees. These do use a reasonably efficient algorithm, but the kd-tree is not necessarily the best data structure for this sort of calculation.

A better algorithm would:

  1. compute the distance matrix for the input. (too big for memory? bootstrap it. also, doesn't even have to be the squareform distance matrix)
  2. find the number of distances less than the cutoff ((distances < d).sum(axis=1) using numpy vectorization & broadcasting) and multiply by 2 (?) to capture pairs
jGaboardi commented 4 years ago

FWIW, Here is the implementation on the K Function in spaghetti. The original formulation predates the pointpats submodule and I am currently working on refactoring for correctness and speedups.

ljwolf commented 4 years ago

I think that implementation is exactly what I'm suggesting above.

https://github.com/pysal/spaghetti/blob/3f14ddaa2fe9fe061adbacf17dea6a197dd73f96/spaghetti/analysis.py#L221

Indeed, we might want to put the functional forms in esda and then, if useful for these custom point pattern classes in R^d or a network domain, we use them directly from `esda.

jGaboardi commented 4 years ago

Also, got a pretty speed in works by performing step checking in a list comprehension.