pysal / segregation

Segregation Measurement, Inferential Statistics, and Decomposition Analysis
https://pysal.org/segregation/
BSD 3-Clause "New" or "Revised" License
111 stars 26 forks source link

add KLDivergence class #192

Closed knaaptime closed 2 years ago

knaaptime commented 2 years ago

this PR adds the enhancements I requested over in #185. Thanks a ton for putting this together @noahbouchier!!

Aside from the formatting, there is one substantive change: instead of returning the matrix of Origin: [statistic, distance], it returns a dataframe of the KL divergence coefficients (the sum of KL profile values for each unit)

the data arent exactly the same, so no thorough testing just yet, but a quick look suggests we've (left) got the same visual results as the paper (right). One thing to note here, we will want to add the option to normalize score values as they do in the original paper. Shouldnt be too hard but it will require us to write another function to determine the scaling factor

image

codecov[bot] commented 2 years ago

Codecov Report

Merging #192 (4f03ea5) into master (deae25c) will decrease coverage by 0.07%. The diff coverage is 71.69%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #192      +/-   ##
==========================================
- Coverage   62.43%   62.35%   -0.08%     
==========================================
  Files         119      123       +4     
  Lines        4784     4928     +144     
==========================================
+ Hits         2987     3073      +86     
- Misses       1797     1855      +58     
Impacted Files Coverage Δ
segregation/batch/batch_compute.py 90.36% <ø> (ø)
segregation/dynamics/__init__.py 100.00% <ø> (ø)
segregation/local/local_multi_diversity.py 95.23% <0.00%> (ø)
segregation/singlegroup/dissim.py 96.29% <ø> (ø)
segregation/singlegroup/modified_dissim.py 67.27% <26.31%> (-23.64%) :arrow_down:
segregation/singlegroup/modified_gini.py 69.23% <26.31%> (-25.90%) :arrow_down:
segregation/network/network.py 64.93% <60.71%> (+39.35%) :arrow_up:
segregation/dynamics/divergence_profile.py 82.35% <82.35%> (ø)
segregation/_base.py 82.47% <85.71%> (ø)
segregation/local/local_distortion.py 94.73% <94.73%> (ø)
... and 14 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update deae25c...4f03ea5. Read the comment docs.

knaaptime commented 2 years ago

we will probably need to think through the terminology a bit:

I think the _kl_divergence_profile @noahbouchier wrote gets us 90% of the way to each of those measures, its just a matter of how we summarize and reshape the output for these different measures

ljwolf commented 2 years ago

Would it be cool to keep a more scikit-oriented API with the metric argument? this was the intent behind specifying the metric argument as we did in GSoC. We can extend the scikit-style metric argument to include a special "network" option, too, and treat it as a special case as they already do metric='precomputed'...

Let me know what you think! Happy to implement.

knaaptime commented 2 years ago

aside from the change to allow precomputed distance metrices, I've finished all the pieces i described above. We've now got

The important remaining work to do is to write a function that simulates the "maximum possible segregation" given the study data. Until we have that, the GlobalDistortion index isn't scaled, so doesnt mean much

knaaptime commented 2 years ago

@jGaboardi if you got a spare minute, any idea what's going on with these windows tests? They seem to be stalling like a quarter of the way through with no apparent reason?

jGaboardi commented 2 years ago

It may be a similar issue to what's going on in spaghetti --> pysal/spaghetti#666. (But also, maybe not...)

knaaptime commented 2 years ago

ok, so after some more thought i think the only remaining question here is how we standardize the api. A "standard" segregation profile works by increasing the threshold in a distance-based weights object (regardless of how many neighbors), then applying distance decay and recomputing the segregation statistic. The KL profile works by increasing the K in a KNN weights object (regardless of distance) then applying distance decay to the neighbors and recalculating the statistic. In both cases we're just expanding the spatial scope, so is it worth it to abstract out the shared logic that both functions can consume?

even if the two functions dont share code, it would be good to standardize the arguments. The current segregation profile already requires a distances argument, which is a list of threshold distances passed to a libpysal.W (or a pandana.Network aggregation query). In the approach described above, distances would be either the matrix of pairwise distances between observations if metric=precomputed, or a pandana.Network if metric=network (which would just create the distance matrix internally, so you'd also need to provide the pandana.Network object)

I think its confusing for one *profile function to have distance as a list of thresholds and another that has distance as a pairwise distance matrix, so what about:

sjsrey commented 2 years ago

Does this close #142 or is there more to do there?

knaaptime commented 2 years ago

i'd say this resolves #142, though we do still need to implement the scaling factor computation