DOK sparse matrix format is a bottleneck

martinfleis commented 2 years ago

I was struggling to do a large interpolation with categorical variables and realised, that the DOK sparse matrix format we use in _area_tables_binning is the culprit. When I convert it to CSR (or almost anything else), I get speedup of several orders of magnitude in large joins.

Check the gist with the benchmark code: https://gist.github.com/martinfleis/7f68536a0fe6f3f30f1b834484d9cbab

Results:

DOK: 679 ms ± 55.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
CSR: 14.1 ms ± 182 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
CSC: 15.9 ms ± 74.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
LIL: 35.3 ms ± 556 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The issue is more significant with categorical variables due to row-wise masking but even without it, there's still huge difference:

DOK (no categorical): 118 ms ± 1.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
CSR (no categorical): 12.3 ms ± 468 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

With large geodataframes, I wasn't able to use the original DOK-based join even when I left it overnight. With CSR, the same task is done in under a second.

So, my question - did I miss anything and there is a good reason to use DOK? Can I switch it to CSR instead?

I also realised that we should move the weights creation behind if intensive_variables etc as we don't always need them.

knaaptime commented 2 years ago

interesting. I'm almost positive we started with CSR in the original implementation, then moved over to DOK for some reason when the numba-based version was devised. I think the switch was based on a perceived necessity at the time, so I afaict, it would be great to adopt this speedup

martinfleis commented 2 years ago

I'll make a PR.

pysal / tobler

DOK sparse matrix format is a bottleneck #159