Closed martinfleis closed 2 years ago
interesting. I'm almost positive we started with CSR in the original implementation, then moved over to DOK for some reason when the numba-based version was devised. I think the switch was based on a perceived necessity at the time, so I afaict, it would be great to adopt this speedup
I'll make a PR.
I was struggling to do a large interpolation with categorical variables and realised, that the DOK sparse matrix format we use in
_area_tables_binning
is the culprit. When I convert it to CSR (or almost anything else), I get speedup of several orders of magnitude in large joins.Check the gist with the benchmark code: https://gist.github.com/martinfleis/7f68536a0fe6f3f30f1b834484d9cbab
Results:
The issue is more significant with categorical variables due to row-wise masking but even without it, there's still huge difference:
With large geodataframes, I wasn't able to use the original DOK-based join even when I left it overnight. With CSR, the same task is done in under a second.
So, my question - did I miss anything and there is a good reason to use DOK? Can I switch it to CSR instead?
I also realised that we should move the
weights
creation behindif intensive_variables
etc as we don't always need them.