Transferred from comments in https://github.com/pysal/tobler/issues/136 per @martinfleis request.

I am encountering an error using masked_area_interpolate() between ~700 US Census tracts to ~200 US Census ZCTAs with nlcd_2019_land_cover as the mask.

CBR: result (after common-bits addition) is INVALID: Self-intersection at or near point...

Verified features can be extracted successfully from the NLCD geotiff using extract_raster_features().

# filepath to geotiff
nlcd_2019_filepath = "./nlcd_2019_land_cover_l48_20210604.img"

# extraction
developed_area = extract_raster_features(
   gdf_select_zctas, nlcd_2019_filepath, pixel_values=[21, 22, 23, 24]
)

display(developed_area.crs, developed_area.info())

<class 'geopandas.geodataframe.GeoDataFrame'> RangeIndex: 1341645 entries, 0 to 1341644 Data columns (total 2 columns): # Column Non-Null Count Dtype
--- ------ -------------- -----
0 value 1341645 non-null float64 1 geometry 1341645 non-null geometry dtypes: float64(1), geometry(1) memory usage: 20.5 MB <Derived Projected CRS: +proj=aea +lat_0=23 +lon_0=-96 +lat_1=29.5 +lat_2= ...> Name: unknown Axis Info [cartesian]: - E[east]: Easting (metre) - N[north]: Northing (metre) Area of Use: - undefined Coordinate Operation: - name: unknown - method: Albers Equal Area Datum: World Geodetic System 1984 - Ellipsoid: WGS 84 - Prime Meridian: Greenwich

Example:

# select tracts associated with an office
source = gdf_tracts[~gdf_tracts["OFFICE_ID"].isna()].copy()
target = gdf_zctas[~gdf_zctas["OFFICE_ID"].isna()].copy()
display(source.head(3), target.head(3))

GeoDataFrames display as expected
display(source.is_valid.value_counts(), target.is_valid.value_counts())
True 1111 dtype: int64 True 336 dtype: int64
# designate geotiff representing developed land
mask = "./nlcd_2019_land_cover_l48_20210604.img"

ensure crs matches NLCD, leveraging developed_area from previous example

nlcd_crs = developed_area.crs.to_string() source.to_crs(nlcd_crs, inplace=True) target.to_crs(nlcd_crs, inplace=True)

attempt possible solution supplied at https://github.com/pysal/tobler/issues/136#issuecomment-786244084

source.geometry = source.buffer(0) target.geometry = target.buffer(0)

define intensive and extensive variables

intensive: do not depend on sample size, ex: percentages

extensive: depend on sample size: ex: population

intensive = [ "ALAND_SMI_SCALED", "POP_DENSITY_SMI_SCALED", ] extensive = ["ALAND", "POP_DENSITY_SMI"]

masked interpolation

results = masked_area_interpolate(raster=mask, source_df=source, target_df=target, intensive_variables=intensive, extensive_variables=extensive)



> **CBR: result (after common-bits addition) is INVALID: Self-intersection at or near point** 950195.55049310706 -41030.093982310347 (950195.5504931070609 -41030.093982310347201)
> <A>
> GEOMETRYCOLLECTION (LINESTRING (950043.6612304252339527 -34960.6609232175906072....

Error message is extensive with similar entries as above.

> SYSTEM INFO
\-----------
python     : 3.8.12 (default, Oct 10 2021, 10:19:55)  [GCC 10.3.0]
executable : REDACTED/.venv/bin/python
machine    : Linux-5.13.0-28-generic-x86_64-with-glibc2.32

> GEOS, GDAL, PROJ INFO
\---------------------
GEOS       : 3.9.0
GEOS lib   : /usr/lib/x86_64-linux-gnu/libgeos_c.so
GDAL       : 3.4.1
GDAL data dir: REDACTED/.venv/lib/python3.8/site-packages/fiona/gdal_data
PROJ       : 8.2.0
PROJ data dir: REDACTED/.venv/lib/python3.8/site-packages/pyproj/proj_dir/share/proj

> PYTHON DEPENDENCIES
\-------------------
geopandas  : 0.10.0
pandas     : 1.4.1
fiona      : 1.8.21
numpy      : 1.22.2
shapely    : 1.8.0
rtree      : 0.9.7
pyproj     : 3.3.0
matplotlib : 3.5.1
mapclassify: 2.4.3
geopy      : 2.2.0
psycopg2   : None
geoalchemy2: None
pyarrow    : 6.0.1
pygeos     : 0.10.2

Per comments in (#136)[https://github.com/pysal/tobler/issues/136] as an alternative:
> Another thing to try would be to see if you can follow [this notebook](https://github.com/pysal/tobler/blob/master/notebooks/binary_dasymetric.ipynb) using your data (though you can omit the OSM step if you want) which performs essentially the same task as masked_area_interpolate but with a different approach

This notebook is useful, thank you. Will review and report on any solution.

Update: iterclip() takes anywhere from ~20 to 60 seconds for each row making it prohibitive for large numbers of geometries. (Also, if you have a subset selected from a geodataframe, make sure you have a consecutive index for iterclip to work; see reset_index())

In attempting to create a reproducable example I ran into another issue.

ValueError: Must have equal len keys and value when setting with an iterable

import requests
import geopandas as gpd
from tobler.dasymetric import extract_raster_features, masked_area_interpolate

"""
# commented out as uneeded for example since data is loaded from gist
def miles_radius_by_coords(x_lon:float, y_lat:float, miles:float, gdf:gpd.geodataframe.GeoDataFrame, reset_index:bool=False) -> List[gpd.geodataframe.GeoDataFrame]:
    #Provided a point in epsg:4326, buffers by miles and return intersecting polygons of provided gdf.
    gdf_crs = gdf.crs
    gdf_columns = gdf.columns.tolist()
    window = gpd.GeoDataFrame(geometry=gpd.points_from_xy(x=[x_lon], y=[y_lat], crs="epsg:4326").to_crs("esri:102008").buffer(miles * 1609.344))
    gdf = gdf.to_crs("esri:102008").sjoin(window, how="inner", predicate="intersects")[gdf_columns].to_crs(gdf_crs)
    return gdf.reset_index(drop=True) if reset_index else gdf

# determine nlcd crs
mask = "./nlcd_2019_land_cover_l48_20210604.img"
nlcd_crs = extract_raster_features(target, mask, pixel_values=[21, 22, 23, 24], collapse_values=True).crs

# subset geodataframes by 2 mile radius from location, setting crs to match mask's
miles = 2
coords = (-84.558735, 39.230207)
source = miles_radius_by_coords(*coords, miles, score_geo, reset_index=True).to_crs(nlcd_crs)
target = miles_radius_by_coords(*coords, miles, gdf_select_zctas, reset_index=True).to_crs(nlcd_crs)"""

# designate geotiff path
mask = "./nlcd_2019_land_cover_l48_20210604.img"

# determine nlcd crs
mask = "./nlcd_2019_land_cover_l48_20210604.img"
nlcd_crs = extract_raster_features(target, mask, pixel_values=[21, 22, 23, 24], collapse_values=True).crs

# load data from gist json features
source_url = "https://gist.githubusercontent.com/johnziebro/d1bc6c59f55a8ba5bf6ead4cd3d2d5ee/raw/3e8370228056f05bd5515ea93f1fdaf8af62f778/source"
target_url = "https://gist.githubusercontent.com/johnziebro/0531432faa14d51073674d471ef99d95/raw/a80a8b2a61fffa8a56f2d4fb1721e2278a3dc41c/target"
source = gpd.GeoDataFrame.from_features(requests.get(source_url).json(), crs=nlcd_crs)
target = gpd.GeoDataFrame.from_features(requests.get(target_url).json(), crs=nlcd_crs)

# sanity check
display(source, source.isna().sum(), source.crs, target, target.isna().sum(), target.crs)
source.plot()
target.plot()

Results display as expected

# define intensive and extensive variables
# intensive: do not depend on sample size, ex: percentages
# extensive: depend on sample size: ex: population
intensive = [
    "ALAND_SMI_S",
    "POP_DENSITY_SMI_S",
    "SCORE",
    "SCORE_POP",
    "SCORE_VEH",
    "SCORE_HOME",
    "SCORE_FIN",
]
extensive = ["ALAND", "AWATER", "ALAND_SMI", "POP_DENSITY_SMI"]

# interpolate
results = masked_area_interpolate(raster=mask,
                                  source_df=source,
                                  target_df=target,
                                  intensive_variables=intensive,
                                  extensive_variables=extensive)

results

Traceback

--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Input In [199], in 13 extensive = ["ALAND", "AWATER", "ALAND_SMI", "POP_DENSITY_SMI"] 15 # interpolate ---> 16 results = masked_area_interpolate(raster=mask, 17 source_df=source, 18 target_df=target, 19 intensive_variables=intensive, 20 extensive_variables=extensive) 22 results File ~/Projects/REDACT/.venv/lib/python3.8/site-packages/tobler/dasymetric/masked_area_interpolate.py:57, in masked_area_interpolate(source_df, target_df, raster, codes, force_crs_match, extensive_variables, intensive_variables, allocate_total, tables) 52 raise IOError( 53 "You must pass the path to a raster that can be read with rasterio" 54 ) 56 if not tables: ---> 57 tables = _area_tables_raster( 58 source_df, 59 target_df.copy(), 60 raster_path=raster, 61 codes=codes, 62 force_crs_match=force_crs_match, 63 ) 65 # In area_interpolate, the resulting variable has same length as target_df 66 interpolation = _slow_area_interpolate( 67 source_df, 68 target_df.copy(), (...) 72 tables=tables, 73 ) File ~/Projects/REDACT/.venv/lib/python3.8/site-packages/tobler/area_weighted/area_interpolate.py:614, in _area_tables_raster(source_df, target_df, raster_path, codes, force_crs_match) 611 source_df.loc[:, "_left"] = _left # create temporary index for union 612 target_df.loc[:, "_right"] = _right # create temporary index for union --> 614 res_union_pre = gpd.overlay(source_df, target_df, how="union") 616 # Establishing a CRS for the generated union 617 warnings.warn( 618 "The CRS for the generated union will be set to be the same as source_df." 619 ) File ~/Projects/REDACT/.venv/lib/python3.8/site-packages/geopandas/tools/overlay.py:354, in overlay(df1, df2, how, keep_geom_type, make_valid) 348 # level_0 created with above reset_index operation 349 # and represents the original geometry collections 350 # TODO avoiding dissolve to call unary_union in this case could further 351 # improve performance (we only need to collect geometries in their 352 # respective Multi version) 353 dissolved = exploded.dissolve(by="level_0") --> 354 result.loc[is_collection, geom_col] = dissolved[geom_col].values 355 else: 356 num_dropped_collection = 0 File ~/Projects/REDACT/.venv/lib/python3.8/site-packages/pandas/core/indexing.py:716, in _LocationIndexer.__setitem__(self, key, value) 713 self._has_valid_setitem_indexer(key) 715 iloc = self if self.name == "iloc" else self.obj.iloc --> 716 iloc._setitem_with_indexer(indexer, value, self.name) File ~/Projects/REDACT/.venv/lib/python3.8/site-packages/pandas/core/indexing.py:1688, in _iLocIndexer._setitem_with_indexer(self, indexer, value, name) 1685 # align and set the values 1686 if take_split_path: 1687 # We have to operate column-wise -> 1688 self._setitem_with_indexer_split_path(indexer, value, name) 1689 else: 1690 self._setitem_single_block(indexer, value, name) File ~/Projects/REDACT/.venv/lib/python3.8/site-packages/pandas/core/indexing.py:1743, in _iLocIndexer._setitem_with_indexer_split_path(self, indexer, value, name) 1738 if len(value) == 1 and not is_integer(info_axis): 1739 # This is a case like df.iloc[:3, [1]] = [0] 1740 # where we treat as df.iloc[:3, 1] = 0 1741 return self._setitem_with_indexer((pi, info_axis[0]), value[0]) -> 1743 raise ValueError( 1744 "Must have equal len keys and value " 1745 "when setting with an iterable" 1746 ) 1748 elif lplane_indexer == 0 and len(value) == len(self.obj.index): 1749 # We get here in one case via .loc with a all-False mask 1750 pass ValueError: Must have equal len keys and value when setting with an iterable

thanks for raising this. Will need to look into the first issue a bit, but my first guess would be possibly something strange with the image file you're using. The second issue is related to your CRS (there is none on your input files). Updating your example to read a version of NLCD stored in an s3 bucket (different year though) I get results. In your original code, the display(target.crs) was showing None, but it was easy to miss with the other results being printed. A quick look at your data looks like maybe its cincinnati? so taking a stab at a crs of 3735 I get the expected results

import requests
import geopandas as gpd
from tobler.dasymetric import extract_raster_features, masked_area_interpolate

# designate geotiff path
mask = "https://spatial-ucr.s3.amazonaws.com/nlcd/landcover/nlcd_landcover_2016.tif"

# load data from gist json features
target_url = "https://gist.githubusercontent.com/johnziebro/0531432faa14d51073674d471ef99d95/raw/a80a8b2a61fffa8a56f2d4fb1721e2278a3dc41c/target"
source_url = "https://gist.githubusercontent.com/johnziebro/d1bc6c59f55a8ba5bf6ead4cd3d2d5ee/raw/3e8370228056f05bd5515ea93f1fdaf8af62f778/source"
source = gpd.GeoDataFrame.from_features(requests.get(source_url).json())
target = gpd.GeoDataFrame.from_features(requests.get(target_url).json())

# sanity check
display(target.crs, source.crs)  # both show None

source.plot()
target.plot()

source.crs = 3735
target.crs = 3735

# extensive: depend on sample size: ex: population
intensive = [
    "ALAND_SMI_S",
    "POP_DENSITY_SMI_S",
    "SCORE",
    "SCORE_POP",
    "SCORE_VEH",
    "SCORE_HOME",
    "SCORE_FIN",
]
extensive = ["ALAND", "AWATER", "ALAND_SMI", "POP_DENSITY_SMI"]

# interpolate
results = masked_area_interpolate(raster=mask,
                                  source_df=source,
                                  target_df=target,
                                  intensive_variables=intensive,
                                  extensive_variables=extensive)

results

print(results)

which produces

          ALAND         AWATER  ALAND_SMI  POP_DENSITY_SMI  ALAND_SMI_S  \
0  1.171083e+07    1167.410256   4.521578     25933.182073     0.005023   
1  3.640912e+07  382619.918149  14.057638     27836.933899     0.008014   
2  1.188643e+07   12190.330131   4.589375     23381.182295     0.005216   
3  1.279708e+07    5011.341463   4.940982     15570.400365     0.008926   

   POP_DENSITY_SMI_S     SCORE  SCORE_POP  SCORE_VEH  SCORE_HOME  SCORE_FIN  \
0           0.154595  0.453222   0.298343   0.162974    0.259685   0.319223   
1           0.097719  0.432718   0.254347   0.202154    0.267997   0.311501   
2           0.157234  0.460360   0.299620   0.214973    0.263867   0.313727   
3           0.110122  0.509490   0.271063   0.237904    0.281837   0.354868   

                                            geometry  
0  POLYGON ((975962.588 1856884.038, 975962.846 1...  
1  POLYGON ((974574.041 1860301.828, 974571.198 1...  
2  POLYGON ((973182.221 1854514.093, 973199.497 1...  
3  POLYGON ((969001.099 1867908.293, 969006.635 1...

results.plot('SCORE_FIN')

also in your earlier example

# commented out as uneeded for example since data is loaded from gist
def miles_radius_by_coords(x_lon:float, y_lat:float, miles:float, gdf:gpd.geodataframe.GeoDataFrame, reset_index:bool=False) -> List[gpd.geodataframe.GeoDataFrame]:
    #Provided a point in epsg:4326, buffers by miles and return intersecting polygons of provided gdf.
    gdf_crs = gdf.crs
    gdf_columns = gdf.columns.tolist()
    window = gpd.GeoDataFrame(geometry=gpd.points_from_xy(x=[x_lon], y=[y_lat], crs="epsg:4236").to_crs("esri:102008").buffer(miles * 1609.344))
    gdf = gdf.to_crs("esri:102008").sjoin(window, how="inner", predicate="intersects")[gdf_columns].to_crs(gdf_crs)
    return gdf.reset_index(drop=True) if reset_index else gdf

you're using epsg 4236 but you probably want 4326? that's probably causing issues too

@knaaptime my apologies, in writing the reproducible example I introduced these crs errors which I do not believe are in my local example. Probably too much screentime in the last few days! Yes, 4326 was intended. In creating the geojson and downloading the example data from the gist, I had assumed the json had retained the crs.

The developed land file was download from https://www.mrlc.gov/data?f%5B0%5D=category%3AUrban%20Imperviousness. extract_raster_features ran successfully and generate a plot of developed areas.

I will update the CRS references above in case anyone comes acorss this later and also run the interpolation with the developed land file you sited and and try to identify exact geometry in my source and targets that may be creating the original issue.

download

cool, let us know how it goes. I'm with you on the screentime--and have made that accidental swap on too many occasions myself ;)

fwiw, if i download the latest version of nlcd, which looks like it's the same you're using, and provided a path to the local file:

import requests
import geopandas as gpd
from tobler.dasymetric import extract_raster_features, masked_area_interpolate

# designate geotiff path
mask = "/Users/knaaptime/Downloads/nlcd_2019_land_cover_l48_20210604/nlcd_2019_land_cover_l48_20210604.img"

# load data from gist json features
target_url = "https://gist.githubusercontent.com/johnziebro/0531432faa14d51073674d471ef99d95/raw/a80a8b2a61fffa8a56f2d4fb1721e2278a3dc41c/target"
source_url = "https://gist.githubusercontent.com/johnziebro/d1bc6c59f55a8ba5bf6ead4cd3d2d5ee/raw/3e8370228056f05bd5515ea93f1fdaf8af62f778/source"
source = gpd.GeoDataFrame.from_features(requests.get(source_url).json())
target = gpd.GeoDataFrame.from_features(requests.get(target_url).json())

source.crs = 3735
target.crs = 3735

# extensive: depend on sample size: ex: population
intensive = [
    "ALAND_SMI_S",
    "POP_DENSITY_SMI_S",
    "SCORE",
    "SCORE_POP",
    "SCORE_VEH",
    "SCORE_HOME",
    "SCORE_FIN",
]
extensive = ["ALAND", "AWATER", "ALAND_SMI", "POP_DENSITY_SMI"]

# interpolate
results = masked_area_interpolate(raster=mask,
                                  source_df=source,
                                  target_df=target,
                                  intensive_variables=intensive,
                                  extensive_variables=extensive)

results

print(results)

I still get the results I posted above (so still not able to reproduce the error locally). Let me know what you discover in your tests

I've updated the reproducible example below. While it does not encounter the original error there are many warnings regarding nan values. What is strange is that isnan().sum() on the the source and target gdfs show no nans present. This is possibly coming from the mask?

import json
import requests
import geopandas as gpd

# type alias
GDF = gpd.geodataframe.GeoDataFrame

# constants
GPS = "epsg:4326"  # WGS 84
DISTANCE = "esri:102005"  # USA Contiguous Equidistant Conic, degrees
AREA = "esri:102008"  # North_America_Albers_Equal_Area_Conic, meters
METERS_PER_MILE = 1609.344

# util functions
def miles_radius_by_coords(
    x_lon: float, y_lat: float, miles: float, gdf: GDF, reset_index: bool = True
) -> GDF:
    """Provided a point in epsg:4326, buffers by miles and return intersecting polygons of provided gdf."""
    gdf_crs = gdf.crs
    gdf_columns = gdf.columns.tolist()
    window = gpd.GeoDataFrame(
        geometry=gpd.points_from_xy(x=[x_lon], y=[y_lat], crs=GPS)
        .to_crs(DISTANCE)
        .buffer(miles * METERS_PER_MILE)
    )
    gdf = (
        gdf.to_crs(DISTANCE)
        .sjoin(window, how="inner", predicate="intersects")[gdf_columns]
        .to_crs(gdf_crs)
    )
    return gdf.reset_index(drop=True) if reset_index else gdf

def get_mask_window(source_gdf: GDF, target_gdf: GDF) -> GDF:
    """Provides a geodataframe consisting of unified geometry of the provided source and target."""
    geometry = gpd.GeoSeries(pd.concat([source_gdf.geometry, target_gdf.geometry]))
    return gpd.GeoDataFrame(geometry=geometry, crs=source_gdf.crs)

# designate geotiff representing developed land
mask = "/media/REDACT/36782A484452D4B2/large_downloads/nlcd_2019_land_cover_l48_20210604/nlcd_2019_land_cover_l48_20210604.img"

# capture polygons intersecting a 2 mile radius from provided coordinates
miles = 2
coords = (-84.558735, 39.230207)
source = miles_radius_by_coords(*coords, miles, score_geo).to_crs(AREA)
target = miles_radius_by_coords(*coords, miles, gdf_select_zctas).to_crs(AREA)

# export for gists
# source_json = source.to_json()
# target_json = target.to_json()

# sub gdfs from github for reproducable example
source_url = "https://gist.githubusercontent.com/johnziebro/603e81d5874d2cbb726baf19eea436a7/raw/296072368b254929bffe2aa7194bc37e3404974c/source_tobler.json"
target_url = "https://gist.githubusercontent.com/johnziebro/cd2129602cc775d82c8afa5b88ef8c1b/raw/68a647e1187cb45e48027e4f6c8eb822ead4eafa/target_tobler.json"
source = gpd.GeoDataFrame.from_features(requests.get(source_url).json(), crs=AREA)
target = gpd.GeoDataFrame.from_features(requests.get(target_url).json(), crs=AREA)

# extract mask features for sanity check
mask_window = get_mask_window(source, target)
urban = extract_raster_features(
    mask_window, mask, pixel_values=[21, 22, 23, 24], collapse_values=True
).to_crs(AREA)

# info
display(
    pd.DataFrame(source.isna().sum()).T,
    source.shape,
    source.crs.to_string(),
    pd.DataFrame(target.isna().sum()).T,
    target.shape,
    target.crs.to_string(),
)

# plot
source.plot()
target.plot()
urban.plot()

# define intensive and extensive variables
# intensive: do not depend on sample size, ex: percentages
# extensive: depend on sample size: ex: population
intensive = [
    "ALAND_SMI_S",
    "POP_DENSITY_SMI_S",
    "SCORE",
    "SCORE_POP",
    "SCORE_VEH",
    "SCORE_HOME",
    "SCORE_FIN",
]
extensive = ["ALAND", "AWATER", "ALAND_SMI", "POP_DENSITY_SMI"]

# interpolate
results = masked_area_interpolate(
    raster=mask,
    source_df=source,
    target_df=target,
    intensive_variables=intensive,
    extensive_variables=extensive,
)

# include ZCTA column
results["ZCTA"] = target["ZCTA5CE10"]

# provide a representative point
results["REP_POINT"] = results["geometry"].apply(
    lambda x: x.representative_point().coords[:][0]
)

# crs to 4326 for lat long representation
results.to_crs(4326, inplace=True)

Many, many "Setting nodata" UserWarnings

/home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/tobler/area_weighted/area_interpolate.py:614: UserWarning: `keep_geom_type=True` in overlay resulted in 205 dropped geometries of different geometry types than df1 has. Set `keep_geom_type=False` to retain all geometries res_union_pre = gpd.overlay(source_df, target_df, how="union") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/tobler/area_weighted/area_interpolate.py:617: UserWarning: The CRS for the generated union will be set to be the same as source_df. warnings.warn( /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly") /home/REDACT/Projects/REDACT/.venv/lib/python3.8/site-packages/rasterstats/io.py:313: UserWarning: Setting nodata to -999; specify nodata explicitly warnings.warn("Setting nodata to -999; specify nodata explicitly")

# transform crs for familiar lon, lat
results.to_crs(GPS, inplace=True)

# figure
fig, ax = plt.subplots(figsize=(20, 11))
plt.tight_layout()

# score
results.plot(
    ax=ax,
    column="SCORE",
    cmap="RdYlGn",
    legend=True,
    edgecolor="none",
    alpha=0.5,
    legend_kwds={"label": "Score"},
)

# contextily tiles for context
ctx.add_basemap(ax=ax, crs=results.crs.to_string(), source=ctx.providers.Stamen.Toner)

# label geometry
results.apply(
    lambda x: ax.annotate(
        text=x["ZCTA"],
        xy=x.geometry.representative_point().coords[0],
        ha="center",
        size=12,
        bbox=dict(boxstyle="round", fc="w")
    ),
    axis=1,
)

# Title and Axis labels
title = f"ZCTA Score\nWithin {miles} mi of {coords[0]}, {coords[1]} - {results.crs.to_string()}"
ax.set_title(title, fontsize=18,)
ax.set_xlabel("Longitude")
ax.set_ylabel("Latitude")

plt.show()

Running the same code, but with a radius of 10 miles instead of 2 results in the following error when running masked_area_interpolate(). Note that the gist data was not used and instead data extracted from the complete gdfs. miles = 10

One other item of note with this setup is that the geotiff is on a secondary drive. I wonder if there could be an IO issue when extracting a large featureset from the image. Though the fact that extract_features() can run successfully and produce a plot seems to rule out it being a rasterio IO issue, see image below.

I will retry with the AWS file provided in the comments above and also move the geotif to the local directory to see if that changes anythong.

Traceback

ValueError: Must have equal len keys and value when setting with an iterable

--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Input In [358], in 13 extensive = ["ALAND", "AWATER", "ALAND_SMI", "POP_DENSITY_SMI"] 15 # interpolate ---> 16 results = masked_area_interpolate( 17 raster=mask, 18 source_df=source, 19 target_df=target, 20 intensive_variables=intensive, 21 extensive_variables=extensive, 22 ) 24 # include ZCTA column 25 results["ZCTA"] = target["ZCTA5CE10"] File ~/Projects/REDACT/.venv/lib/python3.8/site-packages/tobler/dasymetric/masked_area_interpolate.py:57, in masked_area_interpolate(source_df, target_df, raster, codes, force_crs_match, extensive_variables, intensive_variables, allocate_total, tables) 52 raise IOError( 53 "You must pass the path to a raster that can be read with rasterio" 54 ) 56 if not tables: ---> 57 tables = _area_tables_raster( 58 source_df, 59 target_df.copy(), 60 raster_path=raster, 61 codes=codes, 62 force_crs_match=force_crs_match, 63 ) 65 # In area_interpolate, the resulting variable has same length as target_df 66 interpolation = _slow_area_interpolate( 67 source_df, 68 target_df.copy(), (...) 72 tables=tables, 73 ) File ~/Projects/REDACT/.venv/lib/python3.8/site-packages/tobler/area_weighted/area_interpolate.py:614, in _area_tables_raster(source_df, target_df, raster_path, codes, force_crs_match) 611 source_df.loc[:, "_left"] = _left # create temporary index for union 612 target_df.loc[:, "_right"] = _right # create temporary index for union --> 614 res_union_pre = gpd.overlay(source_df, target_df, how="union") 616 # Establishing a CRS for the generated union 617 warnings.warn( 618 "The CRS for the generated union will be set to be the same as source_df." 619 ) File ~/Projects/REDACT/.venv/lib/python3.8/site-packages/geopandas/tools/overlay.py:354, in overlay(df1, df2, how, keep_geom_type, make_valid) 348 # level_0 created with above reset_index operation 349 # and represents the original geometry collections 350 # TODO avoiding dissolve to call unary_union in this case could further 351 # improve performance (we only need to collect geometries in their 352 # respective Multi version) 353 dissolved = exploded.dissolve(by="level_0") --> 354 result.loc[is_collection, geom_col] = dissolved[geom_col].values 355 else: 356 num_dropped_collection = 0 File ~/Projects/REDACT/.venv/lib/python3.8/site-packages/pandas/core/indexing.py:716, in _LocationIndexer.__setitem__(self, key, value) 713 self._has_valid_setitem_indexer(key) 715 iloc = self if self.name == "iloc" else self.obj.iloc --> 716 iloc._setitem_with_indexer(indexer, value, self.name) File ~/Projects/REDACT/.venv/lib/python3.8/site-packages/pandas/core/indexing.py:1688, in _iLocIndexer._setitem_with_indexer(self, indexer, value, name) 1685 # align and set the values 1686 if take_split_path: 1687 # We have to operate column-wise -> 1688 self._setitem_with_indexer_split_path(indexer, value, name) 1689 else: 1690 self._setitem_single_block(indexer, value, name) File ~/Projects/REDACT/.venv/lib/python3.8/site-packages/pandas/core/indexing.py:1743, in _iLocIndexer._setitem_with_indexer_split_path(self, indexer, value, name) 1738 if len(value) == 1 and not is_integer(info_axis): 1739 # This is a case like df.iloc[:3, [1]] = [0] 1740 # where we treat as df.iloc[:3, 1] = 0 1741 return self._setitem_with_indexer((pi, info_axis[0]), value[0]) -> 1743 raise ValueError( 1744 "Must have equal len keys and value " 1745 "when setting with an iterable" 1746 ) 1748 elif lplane_indexer == 0 and len(value) == len(self.obj.index): 1749 # We get here in one case via .loc with a all-False mask 1750 pass ValueError: Must have equal len keys and value when setting with an iterable

Same ValueError as above when using the compressed nlcd_landcover_2016.tif mask loaded from a local directory. Features can be extracted and plotted, however, masked_area_interpolate() fails with

ValueError: Must have equal len keys and value when setting with an iterable

interesting, thanks! I don't know where CBR: result (after common-bits addition) is INVALID: Self-intersection at or near point... comes from, but I think I know what's going on with ValueError: Must have equal len keys and value when setting with an iterable

two quick things:

The NaN warnings pop up because in the old function we internally convert the raster to a vector, so every pixel needs a population value
I'm no expert on coordinate systems, but since you're doing distance calculations i think you want 102005 instead of 102008

but neither of those are leading to the error you're seeing

tl;dr: Here's a function that should get you there (and i'd like to merge this into tobler as a replacement for the current version shortly)

import geopandas as gpd
from tobler.area_weighted import area_interpolate
from tobler.dasymetric import extract_raster_features

def masked_area_interpolate(
    source_df,
    target_df,
    raster,
    codes=[21, 22, 23, 24],
    extensive_variables=None,
    intensive_variables=None,
    categorical_variables=None,
    allocate_total=True,
    nodata=255,
    n_jobs=-1,
):
    source_df = source_df.copy()
    assert not any(
        source_df.index.duplicated()
    ), "The index of the source_df cannot contain duplicates."

    #  create a vector mask from the raster data
    raster_mask = extract_raster_features(
        source_df, raster, codes, nodata, n_jobs, collapse_values=True
    )
    idx_name = source_df.index.name if source_df.index.name else "index"
    source_df[idx_name] = source_df.index

    # clip source_df by its mask (overlay/dissolve is faster than gpd.clip here)
    source_df = gpd.overlay(
        source_df, raster_mask.to_crs(source_df.crs), how="intersection"
    ).dissolve(idx_name)

    # continue with standard areal interpolation using the clipped source
    interpolation = area_interpolate(
        source_df,
        target_df.copy(),
        extensive_variables=extensive_variables,
        intensive_variables=intensive_variables,
        n_jobs=n_jobs,
        categorical_variables=categorical_variables,
        allocate_total=allocate_total,
    )
    return interpolation

(in my quick tests that is faster than the old version of the function anyway)

background: we wrote the old version of masked_raster_interpolate in the beginning of the library, before pygeos existed, and it operates basically at the pixel-level. Later, we came up with a major performance enhancement by creating an intersection table ahead of time, and using that to speed up the area weighting process. In the transition to the faster version, we focused on the area_weighted function, but i think we goofed the logic for raster masking a bit... I think what's happening in the raster tabling process is that we essentially use the raster to search for any polygons in source_df that don't contain any of the specified pixel values, then exclude those polys from the analysis. That can lead to a situation where the tables are misaligned, so you get a mismatch in the length of the iterables.

Operating at the raster level was always hacky and a method of last-resort, so now that the tools are in place, the better path forward is to operate at the vector level. The function above does that instead.

I always find that building the tooling takes the most effort, but once you have the tools, the "house" is so much easier to build. I like the addition of the categoricals param, I don't think it was in the old version. I was going to address categoricals by copying the required Series from the target gdf to the result gdf, but this is good.

Good call on distance with the radius selection function. I've updated the comment above to use the appropriate CRS.

That works beautifully on the 10 mile radius!

I can confirm the new masked_raster_interpolate function works on my complete unfiltered geodataframes of Census Tracts and ZCTAs spanning about 10 counties with no errors in a reasonable amount of time.

Updating issue title to "masked_raster_interpolate: ValueError: Must have equal len keys and value when setting with an iterable" since it ended up being the actual problem which was solved.

Thank you very much for your work on Tobler and on this issue, I really enjoy working with it! Closing as resolved.

pysal / tobler

masked_raster_interpolate: ValueError: Must have equal len keys and value when setting with an iterable #156

ensure crs matches NLCD, leveraging developed_area from previous example

attempt possible solution supplied at https://github.com/pysal/tobler/issues/136#issuecomment-786244084

define intensive and extensive variables

intensive: do not depend on sample size, ex: percentages

extensive: depend on sample size: ex: population

masked interpolation