pysal / tobler

Spatial interpolation, Dasymetric Mapping, & Change of Support
https://pysal.org/tobler
BSD 3-Clause "New" or "Revised" License
144 stars 30 forks source link

Add documentation on the effects of duplicates in the source geometries #182

Open darribas opened 11 months ago

darribas commented 11 months ago

I don't think this is necessarily a bug, but it is something that caught me off guard until I thought it through, and could trip up other users, so maybe the solution is adding a bit of documentation.

In areal interpolation (not sure about other cases), if the source geometries have duplicates or overlaps, the results are wrong. At least for categoricals (I'm not sure what would happen to intensive/extensive, but I think something similar), some percentages add up to more than 1. My sense is this comes from more than one source geometry covering the same patch of land, which then causes it to be counted more than once. Again, this is what the method would do and, arguably, a strange case (it's unusual to have overlapping/duplicate source geometries), but maybe worth adding a line on the source_df documentation?

https://github.com/pysal/tobler/blob/df0cbc6821cdf8c84c7f8b3dcfd1d60eebbb252e/tobler/area_weighted/area_interpolate.py#L221

What do you think?

knaaptime commented 10 months ago

In areal interpolation (not sure about other cases), if the source geometries have duplicates or overlaps, the results are wrong.

not quite. The validity depends on the question. if you've got data on, say, overlapping school districts (some private, some public) and you're sending average test scores to a smaller geometry, then the target geometry contains the weighted average of the area covered by the overlapping polys (which is what you want in this case). If that small poly is covered entirely by two different overlapping schools, one private and one public, then the target gets 50/50 shares

if you've got an extensive variable with overlapping sources (and those overlaps are conceptually valid in the source data,) then the overlapping sum is correct

non-planar geometries are something that can obviously surface a lot in interpolation problems, so i've thought a few times about includng some sort of check, but ultimately non-planarity also a basic data check and something the user needs to understand about their data, so i've landed on the idea that folks should use https://github.com/sjsrey/geoplanar when they need to check their data