Closed johnziebro closed 2 years ago
In attempting to create a reproducable example I ran into another issue.
ValueError: Must have equal len keys and value when setting with an iterable
import requests
import geopandas as gpd
from tobler.dasymetric import extract_raster_features, masked_area_interpolate
"""
# commented out as uneeded for example since data is loaded from gist
def miles_radius_by_coords(x_lon:float, y_lat:float, miles:float, gdf:gpd.geodataframe.GeoDataFrame, reset_index:bool=False) -> List[gpd.geodataframe.GeoDataFrame]:
#Provided a point in epsg:4326, buffers by miles and return intersecting polygons of provided gdf.
gdf_crs = gdf.crs
gdf_columns = gdf.columns.tolist()
window = gpd.GeoDataFrame(geometry=gpd.points_from_xy(x=[x_lon], y=[y_lat], crs="epsg:4326").to_crs("esri:102008").buffer(miles * 1609.344))
gdf = gdf.to_crs("esri:102008").sjoin(window, how="inner", predicate="intersects")[gdf_columns].to_crs(gdf_crs)
return gdf.reset_index(drop=True) if reset_index else gdf
# determine nlcd crs
mask = "./nlcd_2019_land_cover_l48_20210604.img"
nlcd_crs = extract_raster_features(target, mask, pixel_values=[21, 22, 23, 24], collapse_values=True).crs
# subset geodataframes by 2 mile radius from location, setting crs to match mask's
miles = 2
coords = (-84.558735, 39.230207)
source = miles_radius_by_coords(*coords, miles, score_geo, reset_index=True).to_crs(nlcd_crs)
target = miles_radius_by_coords(*coords, miles, gdf_select_zctas, reset_index=True).to_crs(nlcd_crs)"""
# designate geotiff path
mask = "./nlcd_2019_land_cover_l48_20210604.img"
# determine nlcd crs
mask = "./nlcd_2019_land_cover_l48_20210604.img"
nlcd_crs = extract_raster_features(target, mask, pixel_values=[21, 22, 23, 24], collapse_values=True).crs
# load data from gist json features
source_url = "https://gist.githubusercontent.com/johnziebro/d1bc6c59f55a8ba5bf6ead4cd3d2d5ee/raw/3e8370228056f05bd5515ea93f1fdaf8af62f778/source"
target_url = "https://gist.githubusercontent.com/johnziebro/0531432faa14d51073674d471ef99d95/raw/a80a8b2a61fffa8a56f2d4fb1721e2278a3dc41c/target"
source = gpd.GeoDataFrame.from_features(requests.get(source_url).json(), crs=nlcd_crs)
target = gpd.GeoDataFrame.from_features(requests.get(target_url).json(), crs=nlcd_crs)
# sanity check
display(source, source.isna().sum(), source.crs, target, target.isna().sum(), target.crs)
source.plot()
target.plot()
Results display as expected
# define intensive and extensive variables
# intensive: do not depend on sample size, ex: percentages
# extensive: depend on sample size: ex: population
intensive = [
"ALAND_SMI_S",
"POP_DENSITY_SMI_S",
"SCORE",
"SCORE_POP",
"SCORE_VEH",
"SCORE_HOME",
"SCORE_FIN",
]
extensive = ["ALAND", "AWATER", "ALAND_SMI", "POP_DENSITY_SMI"]
# interpolate
results = masked_area_interpolate(raster=mask,
source_df=source,
target_df=target,
intensive_variables=intensive,
extensive_variables=extensive)
results
thanks for raising this. Will need to look into the first issue a bit, but my first guess would be possibly something strange with the image file you're using. The second issue is related to your CRS (there is none on your input files). Updating your example to read a version of NLCD stored in an s3 bucket (different year though) I get results. In your original code, the display(target.crs)
was showing None, but it was easy to miss with the other results being printed. A quick look at your data looks like maybe its cincinnati? so taking a stab at a crs of 3735 I get the expected results
import requests
import geopandas as gpd
from tobler.dasymetric import extract_raster_features, masked_area_interpolate
# designate geotiff path
mask = "https://spatial-ucr.s3.amazonaws.com/nlcd/landcover/nlcd_landcover_2016.tif"
# load data from gist json features
target_url = "https://gist.githubusercontent.com/johnziebro/0531432faa14d51073674d471ef99d95/raw/a80a8b2a61fffa8a56f2d4fb1721e2278a3dc41c/target"
source_url = "https://gist.githubusercontent.com/johnziebro/d1bc6c59f55a8ba5bf6ead4cd3d2d5ee/raw/3e8370228056f05bd5515ea93f1fdaf8af62f778/source"
source = gpd.GeoDataFrame.from_features(requests.get(source_url).json())
target = gpd.GeoDataFrame.from_features(requests.get(target_url).json())
# sanity check
display(target.crs, source.crs) # both show None
source.plot()
target.plot()
source.crs = 3735
target.crs = 3735
# extensive: depend on sample size: ex: population
intensive = [
"ALAND_SMI_S",
"POP_DENSITY_SMI_S",
"SCORE",
"SCORE_POP",
"SCORE_VEH",
"SCORE_HOME",
"SCORE_FIN",
]
extensive = ["ALAND", "AWATER", "ALAND_SMI", "POP_DENSITY_SMI"]
# interpolate
results = masked_area_interpolate(raster=mask,
source_df=source,
target_df=target,
intensive_variables=intensive,
extensive_variables=extensive)
results
print(results)
which produces
ALAND AWATER ALAND_SMI POP_DENSITY_SMI ALAND_SMI_S \
0 1.171083e+07 1167.410256 4.521578 25933.182073 0.005023
1 3.640912e+07 382619.918149 14.057638 27836.933899 0.008014
2 1.188643e+07 12190.330131 4.589375 23381.182295 0.005216
3 1.279708e+07 5011.341463 4.940982 15570.400365 0.008926
POP_DENSITY_SMI_S SCORE SCORE_POP SCORE_VEH SCORE_HOME SCORE_FIN \
0 0.154595 0.453222 0.298343 0.162974 0.259685 0.319223
1 0.097719 0.432718 0.254347 0.202154 0.267997 0.311501
2 0.157234 0.460360 0.299620 0.214973 0.263867 0.313727
3 0.110122 0.509490 0.271063 0.237904 0.281837 0.354868
geometry
0 POLYGON ((975962.588 1856884.038, 975962.846 1...
1 POLYGON ((974574.041 1860301.828, 974571.198 1...
2 POLYGON ((973182.221 1854514.093, 973199.497 1...
3 POLYGON ((969001.099 1867908.293, 969006.635 1...
results.plot('SCORE_FIN')
also in your earlier example
# commented out as uneeded for example since data is loaded from gist
def miles_radius_by_coords(x_lon:float, y_lat:float, miles:float, gdf:gpd.geodataframe.GeoDataFrame, reset_index:bool=False) -> List[gpd.geodataframe.GeoDataFrame]:
#Provided a point in epsg:4326, buffers by miles and return intersecting polygons of provided gdf.
gdf_crs = gdf.crs
gdf_columns = gdf.columns.tolist()
window = gpd.GeoDataFrame(geometry=gpd.points_from_xy(x=[x_lon], y=[y_lat], crs="epsg:4236").to_crs("esri:102008").buffer(miles * 1609.344))
gdf = gdf.to_crs("esri:102008").sjoin(window, how="inner", predicate="intersects")[gdf_columns].to_crs(gdf_crs)
return gdf.reset_index(drop=True) if reset_index else gdf
you're using epsg 4236 but you probably want 4326? that's probably causing issues too
@knaaptime my apologies, in writing the reproducible example I introduced these crs errors which I do not believe are in my local example. Probably too much screentime in the last few days! Yes, 4326 was intended. In creating the geojson and downloading the example data from the gist, I had assumed the json had retained the crs.
The developed land file was download from https://www.mrlc.gov/data?f%5B0%5D=category%3AUrban%20Imperviousness. extract_raster_features ran successfully and generate a plot of developed areas.
I will update the CRS references above in case anyone comes acorss this later and also run the interpolation with the developed land file you sited and and try to identify exact geometry in my source and targets that may be creating the original issue.
cool, let us know how it goes. I'm with you on the screentime--and have made that accidental swap on too many occasions myself ;)
fwiw, if i download the latest version of nlcd, which looks like it's the same you're using, and provided a path to the local file:
import requests
import geopandas as gpd
from tobler.dasymetric import extract_raster_features, masked_area_interpolate
# designate geotiff path
mask = "/Users/knaaptime/Downloads/nlcd_2019_land_cover_l48_20210604/nlcd_2019_land_cover_l48_20210604.img"
# load data from gist json features
target_url = "https://gist.githubusercontent.com/johnziebro/0531432faa14d51073674d471ef99d95/raw/a80a8b2a61fffa8a56f2d4fb1721e2278a3dc41c/target"
source_url = "https://gist.githubusercontent.com/johnziebro/d1bc6c59f55a8ba5bf6ead4cd3d2d5ee/raw/3e8370228056f05bd5515ea93f1fdaf8af62f778/source"
source = gpd.GeoDataFrame.from_features(requests.get(source_url).json())
target = gpd.GeoDataFrame.from_features(requests.get(target_url).json())
source.crs = 3735
target.crs = 3735
# extensive: depend on sample size: ex: population
intensive = [
"ALAND_SMI_S",
"POP_DENSITY_SMI_S",
"SCORE",
"SCORE_POP",
"SCORE_VEH",
"SCORE_HOME",
"SCORE_FIN",
]
extensive = ["ALAND", "AWATER", "ALAND_SMI", "POP_DENSITY_SMI"]
# interpolate
results = masked_area_interpolate(raster=mask,
source_df=source,
target_df=target,
intensive_variables=intensive,
extensive_variables=extensive)
results
print(results)
I still get the results I posted above (so still not able to reproduce the error locally). Let me know what you discover in your tests
I've updated the reproducible example below. While it does not encounter the original error there are many warnings regarding nan values. What is strange is that isnan().sum() on the the source and target gdfs show no nans present. This is possibly coming from the mask?
import json
import requests
import geopandas as gpd
# type alias
GDF = gpd.geodataframe.GeoDataFrame
# constants
GPS = "epsg:4326" # WGS 84
DISTANCE = "esri:102005" # USA Contiguous Equidistant Conic, degrees
AREA = "esri:102008" # North_America_Albers_Equal_Area_Conic, meters
METERS_PER_MILE = 1609.344
# util functions
def miles_radius_by_coords(
x_lon: float, y_lat: float, miles: float, gdf: GDF, reset_index: bool = True
) -> GDF:
"""Provided a point in epsg:4326, buffers by miles and return intersecting polygons of provided gdf."""
gdf_crs = gdf.crs
gdf_columns = gdf.columns.tolist()
window = gpd.GeoDataFrame(
geometry=gpd.points_from_xy(x=[x_lon], y=[y_lat], crs=GPS)
.to_crs(DISTANCE)
.buffer(miles * METERS_PER_MILE)
)
gdf = (
gdf.to_crs(DISTANCE)
.sjoin(window, how="inner", predicate="intersects")[gdf_columns]
.to_crs(gdf_crs)
)
return gdf.reset_index(drop=True) if reset_index else gdf
def get_mask_window(source_gdf: GDF, target_gdf: GDF) -> GDF:
"""Provides a geodataframe consisting of unified geometry of the provided source and target."""
geometry = gpd.GeoSeries(pd.concat([source_gdf.geometry, target_gdf.geometry]))
return gpd.GeoDataFrame(geometry=geometry, crs=source_gdf.crs)
# designate geotiff representing developed land
mask = "/media/REDACT/36782A484452D4B2/large_downloads/nlcd_2019_land_cover_l48_20210604/nlcd_2019_land_cover_l48_20210604.img"
# capture polygons intersecting a 2 mile radius from provided coordinates
miles = 2
coords = (-84.558735, 39.230207)
source = miles_radius_by_coords(*coords, miles, score_geo).to_crs(AREA)
target = miles_radius_by_coords(*coords, miles, gdf_select_zctas).to_crs(AREA)
# export for gists
# source_json = source.to_json()
# target_json = target.to_json()
# sub gdfs from github for reproducable example
source_url = "https://gist.githubusercontent.com/johnziebro/603e81d5874d2cbb726baf19eea436a7/raw/296072368b254929bffe2aa7194bc37e3404974c/source_tobler.json"
target_url = "https://gist.githubusercontent.com/johnziebro/cd2129602cc775d82c8afa5b88ef8c1b/raw/68a647e1187cb45e48027e4f6c8eb822ead4eafa/target_tobler.json"
source = gpd.GeoDataFrame.from_features(requests.get(source_url).json(), crs=AREA)
target = gpd.GeoDataFrame.from_features(requests.get(target_url).json(), crs=AREA)
# extract mask features for sanity check
mask_window = get_mask_window(source, target)
urban = extract_raster_features(
mask_window, mask, pixel_values=[21, 22, 23, 24], collapse_values=True
).to_crs(AREA)
# info
display(
pd.DataFrame(source.isna().sum()).T,
source.shape,
source.crs.to_string(),
pd.DataFrame(target.isna().sum()).T,
target.shape,
target.crs.to_string(),
)
# plot
source.plot()
target.plot()
urban.plot()
# define intensive and extensive variables
# intensive: do not depend on sample size, ex: percentages
# extensive: depend on sample size: ex: population
intensive = [
"ALAND_SMI_S",
"POP_DENSITY_SMI_S",
"SCORE",
"SCORE_POP",
"SCORE_VEH",
"SCORE_HOME",
"SCORE_FIN",
]
extensive = ["ALAND", "AWATER", "ALAND_SMI", "POP_DENSITY_SMI"]
# interpolate
results = masked_area_interpolate(
raster=mask,
source_df=source,
target_df=target,
intensive_variables=intensive,
extensive_variables=extensive,
)
# include ZCTA column
results["ZCTA"] = target["ZCTA5CE10"]
# provide a representative point
results["REP_POINT"] = results["geometry"].apply(
lambda x: x.representative_point().coords[:][0]
)
# crs to 4326 for lat long representation
results.to_crs(4326, inplace=True)
# transform crs for familiar lon, lat
results.to_crs(GPS, inplace=True)
# figure
fig, ax = plt.subplots(figsize=(20, 11))
plt.tight_layout()
# score
results.plot(
ax=ax,
column="SCORE",
cmap="RdYlGn",
legend=True,
edgecolor="none",
alpha=0.5,
legend_kwds={"label": "Score"},
)
# contextily tiles for context
ctx.add_basemap(ax=ax, crs=results.crs.to_string(), source=ctx.providers.Stamen.Toner)
# label geometry
results.apply(
lambda x: ax.annotate(
text=x["ZCTA"],
xy=x.geometry.representative_point().coords[0],
ha="center",
size=12,
bbox=dict(boxstyle="round", fc="w")
),
axis=1,
)
# Title and Axis labels
title = f"ZCTA Score\nWithin {miles} mi of {coords[0]}, {coords[1]} - {results.crs.to_string()}"
ax.set_title(title, fontsize=18,)
ax.set_xlabel("Longitude")
ax.set_ylabel("Latitude")
plt.show()
Running the same code, but with a radius of 10 miles instead of 2 results in the following error when running masked_area_interpolate(). Note that the gist data was not used and instead data extracted from the complete gdfs.
miles = 10
One other item of note with this setup is that the geotiff is on a secondary drive. I wonder if there could be an IO issue when extracting a large featureset from the image. Though the fact that extract_features() can run successfully and produce a plot seems to rule out it being a rasterio IO issue, see image below.
I will retry with the AWS file provided in the comments above and also move the geotif to the local directory to see if that changes anythong.
Traceback
Same ValueError as above when using the compressed nlcd_landcover_2016.tif mask loaded from a local directory. Features can be extracted and plotted, however, masked_area_interpolate() fails with
ValueError: Must have equal len keys and value when setting with an iterable
interesting, thanks! I don't know where CBR: result (after common-bits addition) is INVALID: Self-intersection at or near point...
comes from, but I think I know what's going on with ValueError: Must have equal len keys and value when setting with an iterable
two quick things:
but neither of those are leading to the error you're seeing
tl;dr: Here's a function that should get you there (and i'd like to merge this into tobler as a replacement for the current version shortly)
import geopandas as gpd
from tobler.area_weighted import area_interpolate
from tobler.dasymetric import extract_raster_features
def masked_area_interpolate(
source_df,
target_df,
raster,
codes=[21, 22, 23, 24],
extensive_variables=None,
intensive_variables=None,
categorical_variables=None,
allocate_total=True,
nodata=255,
n_jobs=-1,
):
source_df = source_df.copy()
assert not any(
source_df.index.duplicated()
), "The index of the source_df cannot contain duplicates."
# create a vector mask from the raster data
raster_mask = extract_raster_features(
source_df, raster, codes, nodata, n_jobs, collapse_values=True
)
idx_name = source_df.index.name if source_df.index.name else "index"
source_df[idx_name] = source_df.index
# clip source_df by its mask (overlay/dissolve is faster than gpd.clip here)
source_df = gpd.overlay(
source_df, raster_mask.to_crs(source_df.crs), how="intersection"
).dissolve(idx_name)
# continue with standard areal interpolation using the clipped source
interpolation = area_interpolate(
source_df,
target_df.copy(),
extensive_variables=extensive_variables,
intensive_variables=intensive_variables,
n_jobs=n_jobs,
categorical_variables=categorical_variables,
allocate_total=allocate_total,
)
return interpolation
(in my quick tests that is faster than the old version of the function anyway)
background: we wrote the old version of masked_raster_interpolate
in the beginning of the library, before pygeos existed, and it operates basically at the pixel-level. Later, we came up with a major performance enhancement by creating an intersection table ahead of time, and using that to speed up the area weighting process. In the transition to the faster version, we focused on the area_weighted
function, but i think we goofed the logic for raster masking a bit... I think what's happening in the raster tabling process is that we essentially use the raster to search for any polygons in source_df
that don't contain any of the specified pixel values, then exclude those polys from the analysis. That can lead to a situation where the tables are misaligned, so you get a mismatch in the length of the iterables.
Operating at the raster level was always hacky and a method of last-resort, so now that the tools are in place, the better path forward is to operate at the vector level. The function above does that instead.
I always find that building the tooling takes the most effort, but once you have the tools, the "house" is so much easier to build. I like the addition of the categoricals param, I don't think it was in the old version. I was going to address categoricals by copying the required Series from the target gdf to the result gdf, but this is good.
Good call on distance with the radius selection function. I've updated the comment above to use the appropriate CRS.
That works beautifully on the 10 mile radius!
I can confirm the new masked_raster_interpolate function works on my complete unfiltered geodataframes of Census Tracts and ZCTAs spanning about 10 counties with no errors in a reasonable amount of time.
Updating issue title to "masked_raster_interpolate: ValueError: Must have equal len keys and value when setting with an iterable" since it ended up being the actual problem which was solved.
Thank you very much for your work on Tobler and on this issue, I really enjoy working with it! Closing as resolved.
Transferred from comments in https://github.com/pysal/tobler/issues/136 per @martinfleis request.
I am encountering an error using masked_area_interpolate() between ~700 US Census tracts to ~200 US Census ZCTAs with nlcd_2019_land_cover as the mask.
Verified features can be extracted successfully from the NLCD geotiff using extract_raster_features().
Example:
ensure crs matches NLCD, leveraging developed_area from previous example
nlcd_crs = developed_area.crs.to_string() source.to_crs(nlcd_crs, inplace=True) target.to_crs(nlcd_crs, inplace=True)
attempt possible solution supplied at https://github.com/pysal/tobler/issues/136#issuecomment-786244084
source.geometry = source.buffer(0) target.geometry = target.buffer(0)
define intensive and extensive variables
intensive: do not depend on sample size, ex: percentages
extensive: depend on sample size: ex: population
intensive = [ "ALAND_SMI_SCALED", "POP_DENSITY_SMI_SCALED", ] extensive = ["ALAND", "POP_DENSITY_SMI"]
masked interpolation
results = masked_area_interpolate(raster=mask, source_df=source, target_df=target, intensive_variables=intensive, extensive_variables=extensive)